High throughput tight binding calculation of electronic HOMO-LUMO gaps and its prediction for natural compounds

10 April 2025, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

This research investigates predicting the HOMO-LUMO (HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and material science. Addressing the computational expense of traditional methods, this study develops a high-throughput, machine learning-based approach. Using 407,000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing Slurm cluster, utilized xTB for electronic structure calculations with Boltzmann weighting across multiple conformational states. Gradient boosting regression (GBR) and a Multi-layer Perceptron regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in both models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.

Keywords

workflow
CWL
tight binding
band gap
cheminformatics
machine learning
regression
data science
interoperability
reusability

Supplementary materials

Title
Description
Actions
Title
Supporting information
Description
parameters for the ML-Model, additional plots
Actions
Title
outlier molecules
Description
collection of outlier molecules with error >0.8 eV in predicting the HOMO-LUMO gap
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.