Abstract
Transferability, especially in the context of model generalization, is a paradigm of all scientific disciplines. However, the rapid advancement of machine learned model development threatens this paradigm, as it can be difficult to understand how transferability is embedded (or missed) in complex models. While transferability in general chemistry machine learning should benefit from diverse training data, a rigorous understanding of transferability together with its interplay with chemical representation remains an open problem. We introduce a transferability framework and apply it to a controllable data-driven model for developing density functional approximations (DFAs), an indispensable tool in everyday chemistry research. We reveal that human intuition introduces chemical biases that can hamper the transferability of data-driven DFAs, and we identify strategies for their elimination. We then show that uncritical use of large training sets can actually hinder the transferability of DFAs, in contradiction to typical “more is more” expectations. Finally, our transferability framework yields transferable diversity, a cornerstone principle for data curation for developing general-purpose machine learning models in chemistry