Excuse me, there is a mutant in my bioactivity soup!
A comprehensive analysis of the genetic variability landscape of bioactivity databases and its effect on activity modelling

Marina Gorostiola González; Olivier J. M. Béquignon; Emma Manners; Anna Gaulton; Prudence Mutowo; Elisabeth Dawson; Barbara Zdrazil; Andrew R. Leach; Adriaan P. IJzerman; Laura H. Heitman; Gerard J. P. van Westen

doi:10.26434/chemrxiv-2024-kxlgm

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Excuse me, there is a mutant in my bioactivity soup! A comprehensive analysis of the genetic variability landscape of bioactivity databases and its effect on activity modelling

24 June 2024, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Bioactivity prediction is essential in computational drug discovery, particularly within virtual screening campaigns. Despite advancements in model architectures and features, the sparsity and quality of relevant training data remain a major bottleneck. Notably, genetic variance annotation, crucial for understanding variant-specific bioactivity, is often neglected. Key efforts to tackle these issues are conducted by public bioactivity databases such as ChEMBL, but these are not free of challenges. Here, a comprehensive analysis of the extent and distribution of bioactivity data tested on genetic variants across organisms, protein families, individual targets, and specific variants, for the first time characterises in detail the genetic variability landscape in the ChEMBL database and sheds light on the range and consequences of protein amino acid substitutions in bioactivity data distribution and modelling. Furthermore, an extensive set of analysis resources (Python package and notebooks) and a variant-annotated bioactivity dataset are made available to help replicate the analyses described here for any protein of interest and make informed decisions regarding the quality of data for modelling. Finally, the potential to extract variants and subsets of the chemical space with desirable inter-variant bioactivity profiles is demonstrated for data-rich proteins. This approach contributes to more reliable bioactivity modelling, aids noise reduction and informs decision-making in computational drug discovery.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supplementary Figures

Description

Supplementary Figures

Actions

Title

Supplementary Tables

Description

Supplementary Tables

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jun 24, 2024 Version 1

Metrics

816

405

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2024-kxlgm

Funding

Wellcome Trust

104104/A/14/Z

Wellcome Trust

18244/Z/19/Z

Wellcome Trust

228142/Z/23/Z

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Excuse me, there is a mutant in my bioactivity soup! A comprehensive analysis of the genetic variability landscape of bioactivity databases and its effect on activity modelling

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share