Abstract
The SureChEMBL database provides open access to 17 million chemical entities mentioned in 14
million patents published since 1970. However, alongside with molecules covered by patent
claims, the database is full of starting materials and intermediate products of little
pharmacological relevance. Herein, we introduce a new filtering protocol to automatically select
the core chemical structures best representing a congeneric series of pharmacologically relevant
molecules in patents. The protocol is first validated against a selection of 890 SureChEMBL patents
for which a total of 51,738 manually curated molecules are deposited in ChEMBL. Our protocol
was able to select 92.5% of the molecules in ChEMBL from all 270,968 molecules in SureChEMBL
for those patents. Subsequently, the protocol was applied to all 240,988 US pharmacological
patents for which 9,111,706 molecules are available in SureChEMBL. The unsupervised filtering
process selected 5,949,214 molecules (65.3% of the total number of molecules) that form highly
congeneric chemical series in 188,795 of those patents (78.3% of the total number of patents).