Abstract
Chlorinated compounds are generally known to be non-readily biodegradable. The insight into the structural features that allow chlorinated compounds to readily biodegrade is crucial information that needs to be unveiled. Combined in silico modeling and machine learning approach to predict desirable compound properties has proven to be an effective tool, enabling chemists to save time and resources compared to web lab experimentation. Here we present two machine learning-based quantitative structure – biodegradability relationship (QSBR) models, one for predicting biodegradability values of chlorinated compounds, and the other one for classifying chlorinated compounds as biodegradable or non-biodegradable. The regression models were generated using the Support Vector Regression (SVR) machine learning method. The optimal regression model was a 10 descriptor SVR model with R2 = 0.925 and R2test = 0.881. The optimal classification model was a logistic regression classifier model with 5 descriptors. It has a Matthew’s Correlation Coefficient of 0.59 for training and 0.55 for test, as well as accuracy of 0.79 for training set, 0.77 for test set. For validation purposes the models were tested on an external data set of chlorinated compounds. In addition, models were further applied to an external test set of monomeric units representing polymers to assess the capability of the model to estimate the biodegradability of polymers, where the models showed statistical robustness. The developed SVR model could be used for accurate prediction of biodegradability of various organic molecules, as well as materials based on organic compounds. The analysis of the influence of descriptors on biodegradability is performed. The classification model showed that the biodegradability of chlorinated compounds is heavily correlated with descriptors that relate to electrotopological descriptors, position of oxygen relative to chlorine.