Abstract
The solubility in a given organic solvent is a key parameter in the synthesis, analysis and chemical processing of an active pharmaceutical ingredient. In this work, we introduce a new tool for organic solvent recommendation that ranks possible solvent choices requiring only the SMILES representation of the solvents and solute involved. We report on three additional innovations: First, a differential/relative approach to solubility prediction is employed, in which solubility is modeled using pairs of measurements with the same solute but different solvents. We show that a relative framing of solubility as ranking solvents improves over a corresponding absolute solubility model across a diverse set of selected features. Second, a novel semiempirical featurization based on extended tight-binding (xtb) is applied to both the solvent and the solute, thereby providing physically meaningful representations of the problem at hand. Third, we provide an open-source implementation of this practical and convenient tool for organic solvent recommendation. Taken together, this work could be of benefit to those working in diverse areas, such as chemical engineering, material science, or synthesis planning.
Supplementary materials
Title
Supplementary Information
Description
The SI contains additional statistics on the datasets used, xtb configuration, feature importance and parity plots.
Error analysis by solvent and example outputs of the solvent recommender.
Actions
Title
Full dataset
Description
The full dataset of the two public data sources described in the main article.
Corresponds to the combined dataset of the two public data sources:
Bradley, Jean-Claude; Guha, Rajarshi; Bill Hooker; J. koch, Steven; Lang, Andrew; Neylon, Cameron; et al. (2015). Open Notebook Science Challenge Solubility Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1514952.v1
And the second dataset source:
Pillong, M.; Marx, C.; Piechon, P.; Wicker, J. G. P.; Cooper, R. I.; Wagner, T. A publicly available crystallisation data set and its application in machine learning. CrystEngComm 2017, 19 (27), 3737-3745. DOI: 10.1039/c7ce00738h
Actions
Title
Dataset with COSMO-RS
Description
Dataset on subset of solutes for which COSMO-RS features could be computed with the provided QM pipeline successfully (within the timeout of 48 h).
Actions
Supplementary weblinks
Title
Solvmate github repository
Description
Link to the github repository of the "Solvmate" project.
Actions
View