Abstract
Data subsampling is an established machine learning pre-processing technique to reduce bias in datasets. However, subsampling can lead to the removal of crucial information from the data and thereby decrease performance. Multiple different subsampling strategies have been proposed, and benchmarking is necessary to identify the best strategy for a specific machine learning task. Instead, we propose to use active machine learning as an autonomous and adaptive data subsampling strategy. We show that active learning-based subsampling can lead to better molecular machine learning performance when compared to both training models on the complete training data and 19 state-of-the-art subsampling strategies. We find that active learning is robust to errors in the data, highlighting the utility of this approach for low quality datasets. Taken together, we here describe a new, adaptive machine learning pre-processing approach and provide novel insights into the behavior and robustness of active machine learning for molecular sciences.
Supplementary materials
Title
Supplementary Information
Description
Supplementary Tables and Figures
Actions
Supplementary weblinks
Title
GitHub repository
Description
GitHub repository to access the code used in this study
Actions
View