Abstract
Machine learning-based methods are widely used today in chemical tasks, particularly in drug design. Graph Convolutional Neural Networks (GCNNs) compete with one another in predicting chemical properties, achieving errors comparable with those of experimental measurements. However, the increasing complexity of data entry structures and the trend toward utilizing three-dimensional molecular geometries are rarely grounded in a thorough search for accurate conformations for input. In this study, we examined the stability of the state-of-the-art GCNN architecture for drug discovery and identified vulnerabilities related to the structural features of the compounds. We found that molecular weight significantly influenced the discrepancy between predicted and calculated HOMO-LUMO gap values. We demonstrated that high similarity between new molecules and the training dataset, as measured by Tanimoto indices, did not lead to a qualitative prediction of the model. In contrast, more dissimilar structures require adding less information to the training set for a successful active learning procedure.
Supplementary materials
Title
Supplementary information for the paper Identifying Potential Missteps of Machine Learning in Molecular Chemistry
Description
Supplementary information for the paper Identifying Potential Missteps of Machine Learning in Molecular Chemistry with additional figures.
Actions