Abstract
Developing predictive models of solubility is useful for accelerating solvent selection for applications ranging from electrochemical conversion of organics to pharmaceutical drug development. Herein, we report on the development of a machine learning (ML) workflow for identifying organic co-solvents to increase the concentration of hydrophobic molecules in aqueous mixtures. This task is of particular interest for the electrocatalytic conversion of biomass and bio-oils into sustainable fuels, which faces challenges due to the low aqueous solubility of the feedstock. First, we predict the miscibility of potential co-solvents in water, and we only consider co-solvents that are miscible. Second, we rank co-solvents based on the predicted solubility of the molecule of interest in them. To achieve this, we train two separate ML models: one using the AqSolDB dataset to predict aqueous solubility, and another using the BigSolDB dataset to predict solubility in organic solvents. We select the Light Gradient Boosting Machine (LGBM) model architecture for aqueous solubility (test R2 = 0.864, RMSE = 0.851 for log(S / (mol/dm3)) and organic solubility (test R2 = 0.805, RMSE = 0.511 for log(x)) predictions based on comparing different ML models and features. We examine the generalizability of the organic solubility model on unseen solutes both quantitatively and qualitatively. We evaluate the utility of this ML workflow by identifying co-solvents for benzaldehyde and limonene—two hydrophobic molecules that are relevant for sustainable fuel production—and validate our predictions via experimental solubility measurements.
Supplementary materials
Title
Supplementary Information
Description
The supplementary information includes additional details about feature selection, data preprocessing, model training and validation, and experimental solubility estimation. It also contains tables and figures supplementary to the manuscript.
Actions