Solvmate - A hybrid physical/ML approach to solvent recommendation leveraging a rank-based problem framework

30 April 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The solubility in a given organic solvent is a key parameter in the synthesis, analysis and chemical processing of an active pharmaceutical ingredient. In this work, we introduce a new tool for organic solvent recommendation that ranks possible solvent choices requiring only the SMILES representation of the solvents and solute involved. We report on three additional innovations: First, a differential/relative approach to solubility prediction is employed, in which solubility is modeled using pairs of measurements with the same solute but different solvents. We show that a relative framing of solubility as ranking solvents improves over a corresponding absolute solubility model across a diverse set of selected features. Second, a novel semiempirical featurization based on extended tight-binding (xtb) is applied to both the solvent and the solute, thereby providing physically meaningful representations of the problem at hand. Third, we provide an open-source implementation of this practical and convenient tool for organic solvent recommendation. Taken together, this work could be of benefit to those working in diverse areas, such as chemical engineering, material science, or synthesis planning.

Keywords

Organic Solubility
Solvent Recommender
Web Application

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
The SI contains additional statistics on the datasets used, xtb configuration, feature importance and parity plots. Error analysis by solvent and example outputs of the solvent recommender.
Actions
Title
Full dataset
Description
The full dataset of the two public data sources described in the main article. Corresponds to the combined dataset of the two public data sources: Bradley, Jean-Claude; Guha, Rajarshi; Bill Hooker; J. koch, Steven; Lang, Andrew; Neylon, Cameron; et al. (2015). Open Notebook Science Challenge Solubility Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1514952.v1 And the second dataset source: Pillong, M.; Marx, C.; Piechon, P.; Wicker, J. G. P.; Cooper, R. I.; Wagner, T. A publicly available crystallisation data set and its application in machine learning. CrystEngComm 2017, 19 (27), 3737-3745. DOI: 10.1039/c7ce00738h
Actions
Title
Dataset with COSMO-RS
Description
Dataset on subset of solutes for which COSMO-RS features could be computed with the provided QM pipeline successfully (within the timeout of 48 h).
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.