Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_rev Dataset

Aaron Garrison; Javier Heras-Domingo; John Kitchin; Gabriel Gomes; Zachary Ulissi; Samuel Blau

doi:10.26434/chemrxiv-2023-4m3rt

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_rev Dataset

04 August 2023, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The discovery of novel, high-performing catalysts is essential to the economical development and deployment of many promising materials and fuels. Machine learning (ML) methods have proved useful in the acceleration of the catalyst discovery process, but often yield models restricted to specific chemical domains or types of structures. To obtain generalizable ML models, large and diverse datasets are needed, which tend to exist mostly for heterogeneous catalysis. The tmQM dataset, which contains 86,665 transition metal complexes and their properties calculated at the TPSSh/def2-SVP level of density functional theory, provided a promising dataset to train a generalizable ML model on homogeneous catalyst systems. We trained several ML models on tmQM, and found that these models consistently underpredicted the energies of a chemically distinct subset of the data. To address this, we present the tmQM_rev dataset, which filters out several structures in tmQM found to be missing hydrogens in their molecular geometries and recomputes the energies of all other structures in tmQM at the ωB97M-V/def2-SVPD level of density functional theory. ML models trained on tmQM_rev show no pattern of consistently incorrect predictions and much lower errors than those trained on tmQM. With respect to test set MAE, EwT, and parity, the ML models tested on tmQM_rev were, from best to worst, GemNet-T > PaiNN ~ SpinConv > SchNet. For all models, performance improved when using only neutral structures instead of the entire dataset, which also has charged species. However, models trained on only neutral structures appeared to saturate, while those trained on the entire dataset did not. Learning curves indicate that the models capture the chemical diversity of the neutral species in tmQM_rev. More data improves the model only when including charged species, indicating the importance of accurately capturing a range of oxidation states in future data generation and model development. Furthermore, a fine-tuning approach where model weights were initialized from models trained on OC20 led to drastic improvements in model performance. These results indicate transferability between ML strategies of heterogeneous and homogeneous systems.

Supplementary materials

Title

Description

Actions

Title

Supporting Information

Description

Figures describing the energy distributions of the datasets used, the atomic energies used for reference correction, the MAE and EwT for all models trained on tmQM and on tmQM_rev, learning curves for models trained on tmQM, and test set parity plots for all models trained on tmQM and on tmQM_rev.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Accompanying Code

Description

GitHub repository containing code used in this work. Contains: ASE Atoms representations of the removed structures, tmQM, and tmQM_rev; configuration files for all models trained; guides on how to use the supporting code; fine-tuning configurations, checkpoints, and predictions; predictions of each model on its respective test set; energies used for reference correction; scripts used for processing; trained checkpoints for models presented; and the data splits used to train models in this work.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Nov 09, 2023 Version 2

Aug 04, 2023 Version 1

Metrics

1,455

876

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2023-4m3rt

Funding

Lawrence Berkeley National Laboratory

CMU Wilton E. Scott Institute for Energy Innovation

Author’s competing interest statement

Zachary Ulissi is now a Research Scientist at Meta Fundamental AI Research. However, this work was completed during his time at Carnegie Mellon University.

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_rev Dataset

Authors

Abstract

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share