Structure feature vectors derived from Robocrystallographer text descriptions of crystal structures using word embeddings

10 March 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Capturing structure-property relationships of materials for property prediction using machine learning requires the representation or featurization of the structural aspects of materials at different levels, including atomic, crystal, and microscales. While crystal structure-based modeling techniques are effective for materials informatics, many materials datasets do not have complete structural information. On the other hand, when it comes to discovering novel materials, structural information of the chemical compounds is not known beforehand. These two cases make the application of structure-based learning techniques limited. Tools for the automated generation of structural features are limited in materials science. Indeed, most structural descriptions of materials are done via human analysis and then stored as text entries in scientific documents. One way to extract a structure-based feature vector from this corpus of knowledge would be to leverage natural language processing to create word embeddings trained on these structural descriptions. This approach could encode the information about the kind of structures any individual element of the periodic table forms which can be leveraged to tackle both the situation mentioned above: lack of structural details in a dataset, structural information for novel materials. In this work, we created word embeddings by training models on automatically generated text descriptions of crystal structure from Robocrystallographer and assess the utility of these structure-based feature vectors in predicting materials properties by comparing the performance against mat2vec and one-hot encoded features.

Keywords

materials informatics
machine learning
materials science

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.