A Machine-learning-based Data Analysis Method for Cell-based Selection of DNA-encoded libraries (DELs)

Rui Hou; Chao Xie; Yuhan Gui; Gang Li; Xiaoyu Li

doi:10.26434/chemrxiv-2022-hg2x8

Organic Chemistry

Search within Organic Chemistry

A Machine-learning-based Data Analysis Method for Cell-based Selection of DNA-encoded libraries (DELs)

30 September 2022, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

DNA-encoded library (DEL) is a powerful ligand discovery technology that has been widely adopted in the pharmaceutical industry. DEL selections are typically performed with a purified protein target immobilized on a matrix or in solution phase. Recently, DELs have also been used to interrogate the targets in complex biological environment, such as membrane proteins on live cells. However, due to the complex landscape of the cell surface, the selection inevitably involves significant non-specific interactions, and the selection data is much noisier than the ones with purified proteins, making reliable hit identification highly challenging. Researchers have developed several approaches to denoise DEL datasets, but it remains unclear whether they are suitable for cell-based DEL selections. Here, we propose a new machine-learning (ML)-based approach to process cell-based DEL selection datasets by using a Maximum A Posteriori (MAP) estimation loss function, a probabilistic framework that can account for and quantify uncertainties of noisy data. We applied the approach to a DEL selection dataset, where a library of 7,721,415 compounds was selected against a purified carbonic anhydrase 2 (CA-2) and a cell line expressing the membrane protein carbonic anhydrase 12 (CA-12). The Extended-Connectivity Fingerprint (ECFP)-based regression model using the MAP loss function was able to identify the true binders and also reliable structure-activity relationship (SAR) from the noisy cell-based selection datasets. In addition, the regularized enrichment metric (known as MAP enrichment) could also be calculated directly without involving the specific machine learning model, effectively suppressing low-confidence outliers and enhancing the signal-to-noise ratio.

Keywords

DNA-encoded library

machine learning

high throughput screening

combinatorial library

Supplementary materials

Title

Description

Actions

Title

Supporting Information

Description

Supplementary figures, tables, and experimental methods.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Sep 30, 2022 Version 1

Metrics

1,505

1,120

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2022-hg2x8

Funding

Shenzhen Bay Laboratory

SZBL2020090501008

Research Grants Council of Hong Kong SAR

AoE/P-705/16, 17301118, 17111319, 17303220, 17300321, and C7005-20G

NSFC of China

21877093 and 91953119

Innovation and Technology Commission

"Laboratory for Synthetic Chemistry and Chemical Biology" under the Health@InnoHK Program

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

A Machine-learning-based Data Analysis Method for Cell-based Selection of DNA-encoded libraries (DELs)

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share