Abstract
With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models targeted towards the efficient prediction of hepatotoxicity of query compounds. The predictivity of each of these models was evaluated on a large number of test set compounds. Additionally, the best-performing model was used to screen a true external set of data. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP, and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book (https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.
Supplementary materials
Title
Supplementary Materials SI-1 and SI-2
Description
Supplementary Material-1 (SI-1) contains the original data set used for modeling with the values of structural and physicochemical descriptors and also selected RASAR descriptors.
Supplementary Material-2 (SI-2) contains the following Tables and Figures.
Table S1. List of the 18 different RASAR descriptors and their significance
Table S2. List of QSAR descriptors used for modeling
Table S3. Optimized hyperparameter settings for the ML-based QSAR and c-RASAR models
Table S4. Comparison of test set prediction performance of c-RASAR vs. QSAR models
Table S5. LDA c-RASAR model coefficients
Figure S1. Most discriminating QSAR descriptors.
Figure S2. Heat maps showing the variation in the selected RASAR descriptors for the first and last 20 compounds of the training and test sets.
Figure S3. Chord diagram representing the contribution of sm1 descriptors towards positive and negative activity values.
Actions
Supplementary weblinks
Title
DTC Lab software tools Supplementary Site
Description
The RASAR Descriptor computation tools are available from this site
Actions
View