Extracting and Comparing PFAS from Literature and Patent Documents using Open Access Chemistry Toolkits

Shadrack Barnabas; Timo Böhme; Stephen Boyer; Matthias Irmer; Christoph Ruttkies; Ian Wetherbee; Todor Kondic; Emma Schymanski; Lutz Weber

doi:10.26434/chemrxiv-2022-nmnnd-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Extracting and Comparing PFAS from Literature and Patent Documents using Open Access Chemistry Toolkits

01 April 2022, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: CDK, RDKit and OpenChemLib. Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OpenChemLib implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (~1.7 M PFAS from patents and ~27K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.

Keywords

patents

literature mining

per- and polyfluoroalkyl substances (PFAS)

cheminformatics

non-target screening

open science

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 19, 2022 Version 3

Apr 01, 2022 Version 2

Mar 21, 2022 Version 1

Version Notes

We have added Ian Wetherbee in as a co-author (he was listed in the acknowledgement in the previous version as mutually agreed, due to delays in internal review).

Metrics

3,352

1,760

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2022-nmnnd-v2

Funding

Luxembourg National Research Fund (FNR)

A18/BM/12341006

European Commission

101036756

Author’s competing interest statement

SJB, TB, SB, MI, CR, IW, LW declare that they are involved in the creation of the commercial products OC|processor and Google Patents discussed in this paper.

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Extracting and Comparing PFAS from Literature and Patent Documents using Open Access Chemistry Toolkits

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share