Abstract
Predicting pKa values of small molecules has key applications in drug discovery and molecular simulation. However, current methods face challenges in rigorously interpreting experimental data and ensuring thermodynamic consistency between successive pKa values. This study puts forward a protonation ensemble framework to address these limitations by modeling the full space of possible protonation microstates. Within this framework, we derive rigorous definitions connecting experimental macro-pKas to underlying micro-pKa equilibria. Under this new framework, we develop Uni-pKa, an accurate and reliable pKa predictor. Uni-pKa first pretrains on over 1 million predicted pKas from ChemBL to learn expressive molecular representations. It is then finetuned on experimental datasets that enforce consistency with the protonation ensemble definitions. The high-quality experimental pKa datasets are fitted to this framework by recovering underlying microstates from macro-pKas. Modeling the complete ensemble enables rigorous interpretation of macro-pKa data, and inherently preserves thermodynamic consistency, improving the prediction accuracy of Uni-pKa. Experiments demonstrate that Uni-pKa achieves state-of-the-art performance, outperforming previous methods. This novel protonation ensemble approach significantly advances machine learning for pKa prediction and molecular property modeling. Uni-pKa provides a good example of how to combine chemical knowledge and machine learning methods. Users can utilize Uni-pKa for predicting and ranking the protonation states of molecules under various pH conditions via https://app.bohrium.dp.tech/uni-pka.