Abstract
Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.
Supplementary materials
Title
PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
Description
Modern machine learning methods have become valuable tools for analyzing complex chemical and biological data, particularly in designing molecules with high binding affinity for specific protein targets. However, the success of these methods relies on reliable datasets. Recently, limitations of commonly used datasets have been identified. To address this, the authors have released a dataset with 19,500 protein-ligand complexes and corresponding binding affinities derived from 97,500 molecular dynamics simulations. PLAS-20K is one of the most extensive and diverse datasets available for studying protein-ligand interactions. The dataset exhibits higher accuracy in binding affinity compared to typical docking calculations, suggesting its potential to improve scoring functions for on-the-fly calculation of protein-ligand complexes.
Actions
Supplementary weblinks
Title
PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations
Description
The creation of the PLAS-20K dataset was primarily motivated by the need for high-quality datasets that can support the development of advanced algorithms and drive significant advancements in drug development. The PLAS-20K dataset comprises a diverse collection of protein-ligand (PL) complexes, comprising 19500 protein-ligand complexes, and their corresponding binding affinities, derived from an extensive collection of 97,500 independent molecular dynamics simulations.
Actions
View