Predicting and Explaining Yields with Machine Learning for Carboxylated Azoles and Beyond

16 December 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Carbon dioxide can be transformed into valuable chemical building blocks, including C2-carboxylated 1,3-azoles, which have potential applications in pharmaceuticals, cosmetics, and pesticides. However, only a small fraction of the millions of available 1,3-azoles are carboxylated at the C2 position, highlighting significant opportunities for further research in the synthesis and application of these compounds. In this study, we utilized a supervised machine learning approach to predict reaction yields for a dataset of amide-coupled C2-carboxylated 1,3-azoles. To facilitate molecular design, we integrated an interpretable heat-mapping algorithm named PIXIE (Predictive Insights and Xplainability for Informed chemical space Exploration). PIXIE visualizes the influence of molecular substructures on predicted yields by leveraging fingerprint bit importances, providing synthetic chemists with a powerful tool for the rational design of molecules. While heat mapping is an established technique, its integration with a machine-learning model tailored to the chemical space of C2-carboxylated 1,3-azoles represents a significant advancement. This approach not only enables targeted exploration of this underrepresented chemical space, fostering the discovery of new bioactive compounds, but also demonstrates the potential of combining these methods for broader applications in other chemical domains.

Keywords

carbon dioxide
CO2
yield prediction
heat mapping
fingerprints
azoles
drug design
organic synthesis
molecular design
carboxylation
explainable AI

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Additional information regarding the principal component analysis (PCA), the evaluation of the dataset as well as additional figures on SHAP-based heat mapping and the assessment of isolated yields.
Actions
Title
Results of Grid Search
Description
Various combinations of regression models and molecular descriptors were evaluated to assess their performance in predicting yields.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.