PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

07 August 2023, Version 1

Abstract

Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.

Keywords

computational drug discovery
MD Simulations
Machine Learning
Protein-Ligand affinity Dataset

Supplementary materials

Title
Description
Actions
Title
PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
Description
Modern machine learning methods have become valuable tools for analyzing complex chemical and biological data, particularly in designing molecules with high binding affinity for specific protein targets. However, the success of these methods relies on reliable datasets. Recently, limitations of commonly used datasets have been identified. To address this, the authors have released a dataset with 19,500 protein-ligand complexes and corresponding binding affinities derived from 97,500 molecular dynamics simulations. PLAS-20K is one of the most extensive and diverse datasets available for studying protein-ligand interactions. The dataset exhibits higher accuracy in binding affinity compared to typical docking calculations, suggesting its potential to improve scoring functions for on-the-fly calculation of protein-ligand complexes.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.