Abstract
Bioactivity prediction is essential in computational drug discovery, particularly within virtual screening campaigns. Despite advancements in model architectures and features, the sparsity and quality of relevant training data remain a major bottleneck. Notably, genetic variance annotation, crucial for understanding variant-specific bioactivity, is often neglected. Key efforts to tackle these issues are conducted by public bioactivity databases such as ChEMBL, but these are not free of challenges. Here, a comprehensive analysis of the extent and distribution of bioactivity data tested on genetic variants across organisms, protein families, individual targets, and specific variants, for the first time characterises in detail the genetic variability landscape in the ChEMBL database and sheds light on the range and consequences of protein amino acid substitutions in bioactivity data distribution and modelling. Furthermore, an extensive set of analysis resources (Python package and notebooks) and a variant-annotated bioactivity dataset are made available to help replicate the analyses described here for any protein of interest and make informed decisions regarding the quality of data for modelling. Finally, the potential to extract variants and subsets of the chemical space with desirable inter-variant bioactivity profiles is demonstrated for data-rich proteins. This approach contributes to more reliable bioactivity modelling, aids noise reduction and informs decision-making in computational drug discovery.
Supplementary materials
Title
Supplementary Figures
Description
Supplementary Figures
Actions
Title
Supplementary Tables
Description
Supplementary Tables
Actions