PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

U. Deva Priyakumar; Divya B Korlepara; Vasavi C.S.; Rakesh Srivastava; Pradeep Kumar Pal; Saalim H. Raza; Vishal  Kumar; Shivam Pandit; Aathira G. Nair; Sanjana  Pandey; Shubham Sharma; Shruti  Jeurkar; Kavita  Thakran; Reena Jaglan; Shivangi  Verma; Indhu  Ramachandran; Prathit  Chatterjee; Divya Nayar

doi:10.26434/chemrxiv-2023-mg07d

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

07 August 2023, Version 1

Working Paper

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.

Keywords

computational drug discovery

MD Simulations

Machine Learning

Protein-Ligand affinity Dataset

Supplementary materials

Title

Description

Actions

Title

PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

Description

Modern machine learning methods have become valuable tools for analyzing complex chemical and biological data, particularly in designing molecules with high binding affinity for specific protein targets. However, the success of these methods relies on reliable datasets. Recently, limitations of commonly used datasets have been identified. To address this, the authors have released a dataset with 19,500 protein-ligand complexes and corresponding binding affinities derived from 97,500 molecular dynamics simulations. PLAS-20K is one of the most extensive and diverse datasets available for studying protein-ligand interactions. The dataset exhibits higher accuracy in binding affinity compared to typical docking calculations, suggesting its potential to improve scoring functions for on-the-fly calculation of protein-ligand complexes.

Actions

Supplementary weblinks

Title

Description

Actions

Title

PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations

Description

The creation of the PLAS-20K dataset was primarily motivated by the need for high-quality datasets that can support the development of advanced algorithms and drive significant advancements in drug development. The PLAS-20K dataset comprises a diverse collection of protein-ligand (PL) complexes, comprising 19500 protein-ligand complexes, and their corresponding binding affinities, derived from an extensive collection of 97,500 independent molecular dynamics simulations.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

Divya B. Korlepara, Vasavi C. S., Rakesh Srivastava, Pradeep Kumar Pal, Saalim H. Raza, Vishal Kumar, Shivam Pandit, Aathira G. Nair, Sanjana Pandey, Shubham Sharma, Shruti Jeurkar, Kavita Thakran, Reena Jaglan, Shivangi Verma, Indhu Ramachandran, Prathit Chatterjee, Divya Nayar, U. Deva Priyakumar journal article

Scientific Data , Volume 11, Issue 1

Online publication date: Feb 09, 2024

Version History

Aug 07, 2023 Version 1

Metrics

1,336

562

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-mg07d

Funding

Department of Science and Technology, India

DST/INSPIRE/04/2018/000455

DST-SERB

CRG/2021/008036

IHub-Data, IIIT Hyderabad

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Now Published

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share