Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

Lewis Mervin; Maria-Anna Trapotsi; Avid M. Afzal; Ian Barrett; Andreas Bender; Ola Engkvist

doi:10.26434/chemrxiv.14544291.v1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

07 May 2021, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary.

In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.

Keywords

Probabilistic Random Forest

Cumulative Distribution Function (CDF)

Uncertainty Estimation

Target prediction

QSAR Modeling

Experimental Error

Supplementary materials

Title

Description

Actions

Title

Mervin Manuscript

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Lewis H. Mervin, Maria-Anna Trapotsi, Avid M. Afzal, Ian P. Barrett, Andreas Bender, Ola Engkvist journal article

Journal of Cheminformatics , Volume 13, Issue 1

Online publication date: Aug 19, 2021

Version History

May 07, 2021 Version 1

Metrics

1,887

399

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.14544291.v1

Funding

Biotechnology and Biological Sciences Research Council

BB/M011194/1

Cambridge Doctoral Training Partnership - 2

https://app.dimensions.ai/details/grant/grant.3956483

Author’s competing interest statement

No conflict of interest

Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Share