Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_wB97MV Dataset

09 November 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Machine learning (ML) methods have shown promise for discovering novel catalysts, but are often restricted to specific chemical domains. Generalizable ML models require large and diverse training datasets, which exist for heterogeneous catalysis but not for homogeneous catalysis. The tmQM dataset, which contains properties of 86,665 transition metal complexes calculated at the TPSSh/def2-SVP level of density functional theory (DFT), provided a promising training dataset for homogeneous catalyst systems. However, we find that ML models trained on tmQM consistently underpredict the energies of a chemically distinct subset of the data. To address this, we present the tmQM_wB97MV dataset, which filters out several structures in tmQM found to be missing hydrogens and recomputes the energies of all other structures at the wB97M-V/def2-SVPD level of DFT. ML models trained on tmQM_wB97MV show no pattern of consistently incorrect predictions and much lower errors than those trained on tmQM. The ML models tested on tmQM_wB97MV were, from best to worst, GemNet-T > PaiNN ~ SpinConv > SchNet. Performance consistently improves when using only neutral structures instead of the entire dataset. However, while models saturate with only neutral structures, more data continues to improve the models when including charged species, indicating the importance of accurately capturing a range of oxidation states in future data generation and model development. Furthermore, a fine-tuning approach where weights were initialized from models trained on OC20 led to drastic improvements in model performance, indicating transferability between ML strategies of heterogeneous and homogeneous systems.

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Figures describing the energy distributions of the datasets used, the atomic energies used for reference correction, the MAE and EwT for all models trained on tmQM and on tmQM_wB97MV, learning curves for models trained on tmQM, test set parity plots for all models trained on tmQM and on tmQM_wB97MV, and parity plots showing the effects of removed structures.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.