Abstract
The acid dissociation constant (pK a), which quantifies the propensity for a solute to donate a proton to its solvent, is crucial for drug design and synthesis, environmental fate studies, chemical manufacturing, and many other fields. Unfortunately, the terminology used for describing acid base phenomena is inconsistent, causing large potential for misinterpretation. In this work, we examine a systematic confusion underlying the definition of “acidic” and “basic” pKa values for zwitterionic compounds. Due to this confusion, some pKa data is misrepresented in data repositories, including the widely- used and highly trusted ChEMBL Database. Such datasets are widely used to supply training data for pKa prediction models, and hence, confusion and errors in the data makes model performance worse. Herein, we discuss the intricacies of this issue. We make suggestions for describing acid-base phenomena, training pKa prediction models, and stewarding pKa datasets, given the high potential for confusion and potentially high impact of accurately describing acid-base phenomena.
Supplementary materials
Title
Data used in this study
Description
- iupac_chembl_overlap.csv: .csv containing experimental pKa data with SMILES, along with their corresponding ChEMBL calculations
- iupac_chembl_qupkake_downsampled.csv: .csv containing experimental pKa data with SMILES, along with both ChEMBL and QupKake calculations for a smaller subset of data
- glycine_solubility_needham.csv: solubility data
Actions