Abstract
Capturing structure-property relationships of materials for property prediction using machine learning requires the representation or featurization of the structural aspects of materials at different levels, including atomic, crystal, and microscales. While crystal structure-based modeling techniques are effective for materials informatics, many materials datasets do not have complete structural information. On the other hand, when it comes to discovering novel materials, structural information of the chemical compounds is not known beforehand. These two cases make the application of structure-based learning techniques limited. Tools for the automated generation of structural features are limited in materials science. Indeed, most structural descriptions of materials are done via human analysis and then stored as text entries in scientific documents. One way to extract a structure-based feature vector from this corpus of knowledge would be to leverage natural language processing to create word embeddings trained on these structural descriptions. This approach could encode the information about the kind of
structures any individual element of the periodic table forms which can be leveraged to tackle both the situation mentioned above: lack of structural details in a dataset,
structural information for novel materials. In this work, we created word embeddings by training models on automatically generated text descriptions of crystal structure from Robocrystallographer and assess the utility of these structure-based feature vectors in predicting materials properties by comparing the performance against mat2vec and one-hot encoded features.