Transferable diversity – a data-driven representation of chemical space

Tim Gould; Bun Chang; Stephen Dale; Stefan Vuckovic

doi:10.26434/chemrxiv-2023-5075x-v3

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Transferable diversity – a data-driven representation of chemical space

29 April 2024, Version 3

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Transferability, especially in the context of model generalization, is a paradigm of all scientific disciplines. However, the rapid advancement of machine learned model development threatens this paradigm, as it can be difficult to understand how transferability is embedded (or missed) in complex models. While transferability in general chemistry machine learning should benefit from diverse training data, a rigorous understanding of transferability together with its interplay with chemical representation remains an open problem. We introduce a transferability framework and apply it to a controllable data-driven model for developing density functional approximations (DFAs), an indispensable tool in everyday chemistry research. We reveal that human intuition introduces chemical biases that can hamper the transferability of data-driven DFAs, and we identify strategies for their elimination. We then show that uncritical use of large training sets can actually hinder the transferability of DFAs, in contradiction to typical “more is more” expectations. Finally, our transferability framework yields transferable diversity, a cornerstone principle for data curation for developing general-purpose machine learning models in chemistry

Keywords

transferable diversity

Density functional theory

data

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Apr 29, 2024 Version 3

Oct 03, 2023 Version 2

Sep 11, 2023 Version 1

Version Notes

Links made with machine learning DFT; the framework is better explained now

Metrics

1,734

802

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-5075x-v3

Funding

Swiss National Science Foundation

TMSGI2_211246

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Transferable diversity – a data-driven representation of chemical space

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share