Unassisted Noise-Reduction of Chemical Reactions Data Sets

Alessandra Toniato; Philippe Schwaller; Antonio Cardinale; Joppe Geluykens; Teodoro Laino

doi:10.26434/chemrxiv.12395120.v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Unassisted Noise-Reduction of Chemical Reactions Data Sets

02 February 2021, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones).

With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value of

the cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.

Keywords

Artificial Intelligence research

Organic Syntheses

machine learning-based

data-driven research

noise-reduction

data curation strategy

Supplementary materials

Title

Description

Actions

Title

Unassisted Noise-Reduction of Chemical Reaction Data Sets

Description

Actions

Title

Supporting Information

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.