Data-Driven Prediction of Enantioselectivity for the Sharpless Asymmetric Dihydroxylation: Model Development and Experimental Validation

24 April 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The Sharpless asymmetric dihydroxylation remains a key transformation in chemical synthesis, yet its success hides unexpected cases of lower selectivity. A chemoinformatic workflow was developed to allow data-driven analysis of the reaction. A database of 1007 reactions employing AD-mix α and β was curated from the literature, and an alignment-dependent, fragment-based featurization of alkenes was implemented for modeling. This platform converged on machine learning models capable of predicting the magnitude of enantioselectivity for multiple alkene classes, achieving Q2F3 values ≥ 0.8, test r2 values ≥ 0.7 and mean absolute errors (MAE) ≤ 0.3 kcal/mol. The features of alkenes contributing to model performance were assessed with SHapley Additive exPlanations (SHAP) analysis to gather insight into factors underlying predictions. Experimental validation demonstrated that the models could achieve meaningful predictions on numerous out-of-sample alkenes.

Keywords

asymmetric catalysis
dihydroxylation
machine learning
data science
chemoinformatics

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.