Abstract
We present an approach for end-to-end training of machine learning models for structure-property modeling on collections of datasets derived using different DFT functionals and basis sets. This approach overcomes the problem of data inconsistencies in the training of machine learning models on atomistic data. We rephrase the underlying problem as a multi-task learning scenario. We show that conditioning neural network-based models on trainable embedding vectors can effectively account for quantitative differences between methods. This allows for joint training on multiple datasets that would otherwise be incompatible. Therefore, this procedure circumvents the need for re-computations at a unified level of theory. Numerical experiments demonstrate that training on multiple reference methods enables transfer learning between tasks, resulting in even lower errors compared to training on separate tasks alone. Furthermore, we show that this approach can be used for multi-fidelity learning, improving data efficiency for the highest fidelity by an order of magnitude. To test scalability, we train a single model on a joint dataset compiled from 10 disjoint subsets of the MultiXC-QM9 dataset generated by different reference methods. Again, we observe transfer learning effects that improve the model errors by a factor of 2 compared to training on each subset alone.