Abstract
In recent data-driven approaches to materials discov- ery, scenarios where target quantities are expensive to compute or measure are often overlooked. In such cases, it becomes imperative to construct a training set that includes the most diverse, representative, and informative samples. Here, a novel regression tree-based active learning algorithm is employed for such a purpose. It is applied to predict band gap and adsorption properties of metal-organic frameworks (MOFs), a novel class of materials that results from the virtually infinite combinations of their building units. Simpler and low dimensional descrip- tors, such as the Stoichiometric-120 and geometric properties, found here to better represent MOFs in the low data regime, are used to compute the feature space for this model. The partition given by a regression tree constructed on the labeled part of the dataset is used to select new samples to be added to the training set, thereby limiting its size while maximizing the prediction quality. Through tests on the QMOF, hMOF, and dMOF data sets, we show that our method is effective in constructing small training data sets to learn regression models that predict well the target properties, thus reducing the label- ing cost. Specifically, our active learning approach is highly beneficial when labels are unevenly distributed in the descriptor space and when the label distribution is imbalanced, which is often the case for real world data. This offers a unique tool to efficiently analyze complex structure-property relationships in materials and accelerate materials discovery.
Supplementary materials
Title
Supporting Information
Description
Details of descriptors calculation; comparison of RS and KSS for different descriptors for QMOF; distribution of labels and training for dMOF; comparison of descriptors for CH4 adsorption; table of MAE for prediction of CO2 and CH4 using RT-AL.
Actions