Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting

Esben Jannik Bjerrum

doi:10.26434/chemrxiv-2024-hzddj

Chemical similarity between two molecules finds widespread use in drug discovery and material science, being utilized for similarity search, toxicological assessment, and as a foundation for QSAR models. This study describes models for the estimation of the log-likelihood for a given molecule to belong to a specific dataset, representing a form of similarity between a single molecule and a given dataset. Two different models are derived based on simple counting of fingerprint keys in the molecule and collected statistics for the total number of observations in the dataset. The AtomLL model is shown to be useful for detecting outliers with unusual keys and demonstrates the greatest baseline performance for class membership assignment. The MolLL model can detect outliers with an unusual number of repeats and is also beneficial for keeping de novo molecular generation and optimization in scope. Their performance is compared to a kernel density estimator model based on molecular descriptors. The model code and some precomputed models are available as open source on GitHub.

Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting

Abstract

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share

Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting

Authors

Abstract

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share