Informative Training Data for Efficient Property Prediction in Metal-Organic Frameworks by Active Learning

01 December 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In recent data-driven approaches to materials discov- ery, scenarios where target quantities are expensive to compute or measure are often overlooked. In such cases, it becomes imperative to construct a training set that includes the most diverse, representative, and informative samples. Here, a novel regression tree-based active learning algorithm is employed for such a purpose. It is applied to predict band gap and adsorption properties of metal-organic frameworks (MOFs), a novel class of materials that results from the virtually infinite combinations of their building units. Simpler and low dimensional descrip- tors, such as the Stoichiometric-120 and geometric properties, found here to better represent MOFs in the low data regime, are used to compute the feature space for this model. The partition given by a regression tree constructed on the labeled part of the dataset is used to select new samples to be added to the training set, thereby limiting its size while maximizing the prediction quality. Through tests on the QMOF, hMOF, and dMOF data sets, we show that our method is effective in constructing small training data sets to learn regression models that predict well the target properties, thus reducing the label- ing cost. Specifically, our active learning approach is highly beneficial when labels are unevenly distributed in the descriptor space and when the label distribution is imbalanced, which is often the case for real world data. This offers a unique tool to efficiently analyze complex structure-property relationships in materials and accelerate materials discovery.

Keywords

machine learning
active learning
descriptors
MOFs

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Details of descriptors calculation; comparison of RS and KSS for different descriptors for QMOF; distribution of labels and training for dMOF; comparison of descriptors for CH4 adsorption; table of MAE for prediction of CO2 and CH4 using RT-AL.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.