Structure feature vectors derived from Robocrystallographer text descriptions of crystal structures using word embeddings

Hasan M. Sayeed; Sterling G. Baird; Taylor D. Sparks

doi:10.26434/chemrxiv-2023-3q8wj

Materials Science

Search within Materials Science

Structure feature vectors derived from Robocrystallographer text descriptions of crystal structures using word embeddings

10 March 2023, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Capturing structure-property relationships of materials for property prediction using machine learning requires the representation or featurization of the structural aspects of materials at different levels, including atomic, crystal, and microscales. While crystal structure-based modeling techniques are effective for materials informatics, many materials datasets do not have complete structural information. On the other hand, when it comes to discovering novel materials, structural information of the chemical compounds is not known beforehand. These two cases make the application of structure-based learning techniques limited. Tools for the automated generation of structural features are limited in materials science. Indeed, most structural descriptions of materials are done via human analysis and then stored as text entries in scientific documents. One way to extract a structure-based feature vector from this corpus of knowledge would be to leverage natural language processing to create word embeddings trained on these structural descriptions. This approach could encode the information about the kind of structures any individual element of the periodic table forms which can be leveraged to tackle both the situation mentioned above: lack of structural details in a dataset, structural information for novel materials. In this work, we created word embeddings by training models on automatically generated text descriptions of crystal structure from Robocrystallographer and assess the utility of these structure-based feature vectors in predicting materials properties by comparing the performance against mat2vec and one-hot encoded features.

Keywords

materials informatics

machine learning

materials science

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Mar 10, 2023 Version 1

Metrics

1,067

546

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2023-3q8wj

Funding

National Science Foundation

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Structure feature vectors derived from Robocrystallographer text descriptions of crystal structures using word embeddings

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share