Abstract
Machine learning has the potential to provide tremendous value to the chemical
and material sciences by providing models that promise to save time, energy, and
starting material. Model training requires large amounts of clean high-quality
data, and the methodology for transforming raw data to machine learning-ready
data should be robust, adaptable, and accessible. However, data is often cleaned
differently for different projects using proprietary code, making it difficult to compare
approaches and creating additional effort for other researchers who want to
work with literature-mined data. Herein, we present ORDerly, an open-source
Python package with a novel benchmark for reaction data and a highly customizable
pipeline for cleaning chemical reaction data stored in accordance with the
Open Reaction Database (ORD) schema. ORDerly contains standard cleaning
operations, such as frequency filtering and canonicalization checks, in addition to
chemically-informed assignment of reaction roles using atom mapping, bespoke
name resolution, and reproducible open-source benchmark generation. We use
ORDerly to generate a machine learning-ready benchmark dataset for the prediction
of reaction conditions, and through extensive analysis, we find the aforementioned
cleaning steps to be essential to provide a high quality dataset for machine learning.
In particular, we show that datasets missing key cleaning steps can lead to silently
overinflated performance metrics. We then demonstrate that ORDerly can be used
in an end-to-end pipeline that goes from raw data to a reaction condition prediction
model in less than a day. With this customizable open-source solution for cleaning
and preparing chemical reaction data, ORDerly is poised to push forward the
boundaries of artificial intelligence applications in chemistry by providing a novel
benchmark for chemical reaction conditions, and a data pipeline for researchers in
the chemical sciences to leverage large reaction datasets.
Supplementary materials
Title
ORDerly: Supplementary Information
Description
A: ORDerly Datasheet
B: Dataset extraction and cleaning methodology
C: Further experimental details (training ML models)
D: Example reaction instances and predictions
E: ORDerly benchmark statistics
Actions