Abstract
We have analyzed forty different databases ranging in size from a few thousand to nearly 100 million molecules, comprising a total of over 210 million structures, for their tautomeric conflicts. A tautomeric conflict is defined as an occurrence of two or more structures within a data set identified by the tautomeric rules applied as being tautomers of each other. We tested a total of 119 detailed tautomeric transform rules expressed as SMIRKS, out of which 79 yielded at least one conflict. These transformations include three types of tautomerism: prototropic, ringchain, and valence tautomerism. The databases analyzed spanned a wide variety of types including large aggregating databases, drug collections, and structure collections based on experimental data. All databases analyzed showed intra-database tautomeric conflicts. The conflict rates as percentage of the database were typically in the few tenths of a percent range, which for the largest databases amounts to >100,000 cases per database.
Supplementary materials
Title
Transform File S1
Description
The 120 transforms used in this paper, shown in SMIRKS format, plus the associated flags for their execution in the chemoinformatics toolkit CACTVS.
Actions
Title
Table S2
Description
Tautomeric conflict counts within each database for all multuplet sizes, up to 25.
Actions
Title
Table S3
Description
Conflict counts involving non-persistent stereocenters ("stereo conflicts") for all multuplet sizes, up to
25.
Actions