Guidelines for Machine Learning Yield Prediction from Small-Size Literature Dataset

Jules Schleinitz; Maxime Langevin; Yanis Smail; Benjamin Wehnert; Laurence Grimaud; Rodolphe Vuilleumier

doi:10.26434/chemrxiv-2022-t6435-v2

Chemical Engineering and Industrial Chemistry

Search within Chemical Engineering and Industrial Chemistry

Guidelines for Machine Learning Yield Prediction from Small-Size Literature Dataset

18 May 2022, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Synthetic yield prediction using machine learning is intensively studied. Previous work focused on two categories of datasets: High-Throughput Experimentation data, as an ideal case study and datasets extracted from proprietary databases, which are known to have a strong reporting bias towards high yields. However, predicting yields using published reaction data remains elusive. To fill the gap, we built a dataset on nickel-catalyzed cross-couplings extracted from organic reaction publications, including scope and optimization information. We demonstrate the importance of including optimization data as a source of failed experiments and emphasize how publication constraints shape the exploration of the chemical space by the synthetic community. While machine learning models still fail to perform out-of-sample predictions, this work shows that adding chemical knowledge enables fair predictions in a low-data regime. Eventually, we hope that this unique public database will foster further improvements of machine learning methods for reaction yield prediction in a more realistic context.

Keywords

Dataset

Machine Learning

Reaction Yield Prediction

Supplementary materials

Title

Description

Actions

Title

Supplementary Informations

Description

Details on the code and the methods used to train the model and featurize the data. Additional information supporting the main manuscript.

Actions

Supplementary weblinks

Title

Description

Actions

Title

NiCOlit code and data

Description

The NiCOlit dataset is available. The code used to generate the results is available.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Machine Learning Yield Prediction from NiCOlit, a Small-Size Literature Data Set of Nickel Catalyzed C–O Couplings

Jules Schleinitz, Maxime Langevin, Yanis Smail, Benjamin Wehnert, Laurence Grimaud, Rodolphe Vuilleumier journal article

Journal of the American Chemical Society , Volume 144, Issue 32

Online publication date: Aug 08, 2022

Version History

May 18, 2022 Version 2

Mar 25, 2022 Version 1

Version Notes

The manuscript was revised in order to make the outcomes more intelligible for the synthetic chemist community. The dataset was analyzed in terms of coupling partner, substrates and ligands categories. Reactions features responsible for the model decisions where also highlighted and discussed.

Metrics

2,624

1,241

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2022-t6435-v2

Funding

ANRT

2019/0821

CNRS

ENS

Author’s competing interest statement

M.L. is a Sanofi employee and may hold shares and/or stock options in the company. J.S., B.W., Y.S., R.V., and L.G. declares that they have no competing interests.

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Guidelines for Machine Learning Yield Prediction from Small-Size Literature Dataset

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share