Abstract
This research investigates predicting the HOMO-LUMO (HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and material science. Addressing the computational expense of traditional methods, this study develops a high-throughput, machine learning-based approach. Using 407,000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing Slurm cluster, utilized xTB for electronic structure calculations with Boltzmann weighting across multiple conformational states. Gradient boosting regression (GBR) and a Multi-layer Perceptron regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in both models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.
Supplementary materials
Title
Supporting information
Description
parameters for the ML-Model,
additional plots
Actions
Title
outlier molecules
Description
collection of outlier molecules with error >0.8 eV in predicting the HOMO-LUMO gap
Actions