Abstract
This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the
crucial need for high-quality chemical reaction datasets in the realm of machine learning
applications in organic chemistry. Recent advances in artificial intelligence have expanded the
application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and
reaction condition prediction. However, the effectiveness of these models hinges on the integrity
of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants,
incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a twostage
approach to refine these datasets. The first stage involves extracting meaningful reaction
transformation rules and formulating generic reaction templates using a simplified SMARTS
representation. This simplification broadens the applicability of templates across various chemical
reactions. The second stage is template-guided reaction verification, where these templates
are systematically applied to validate and correct the reaction data. This process effectively
amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect
data entries. A standout feature of AutoTemplate is its capability to concurrently identify and
correct false chemical reactions. It operates on the premise that most reactions in datasets are
accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates
its efficacy across a range of chemical reactions, significantly enhancing dataset quality.
This advancement provides a more robust foundation for developing reliable machine learning
models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions.
AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets,
bridging a vital gap and facilitating more precise and efficient machine learning applications in
organic synthesis. Scientific contribution: The proposed automated preprocessing tool for chemical
reaction data aims to identify errors within chemical databases. Specifically, if the errors
involve atom mapping or the absence of reactant types, corrections can be systematically applied
using reaction templates, ultimately elevating the overall quality of the database.