Abstract
Density and refractive index (nD) are two important properties related to van der Waals energy of a molecule. Thus, accurate prediction of these two properties has a great value in both molecular mechanics force field development, and solvation free energy and solubility prediction of any arbitrary molecules. In this study, we gathered molecule characteristics information of roughly 5,000 organic compounds for density records and 4000 organic compounds for nD values. Subsequently, the distinct GAFF (General AMBER Force Field) descriptors and RDkit descriptors of the compounds were generated and then applied to train various prediction models with a variety of machine learning algorithms for both properties respectively. As a result, both GAFF and RDkit descriptors yielded various robust models with low average percent errors (APE), low root-mean-square errors (RMSE) and high correlation coefficients R-square, while RDkit showed slightly better performance for predicting both properties. We further optimized top models and conducted parallel feature analysis (PFA) to identify specific features in each descriptor which outstandingly contributed to model robustness. The final model RMSE is 0.071 g/cm3 for density prediction and 0.014 for nD prediction, the APE value is as low as 2.845% for density and 0.531% for nD, and R-square is 0.950 for density and 0.954 for nD. Note that the performance of our prediction models for both density and nD significantly outperforms all currently published studies, especially for those with a dataset containing more than 200 records. The successful prediction of the two key molecular properties paves the road towards accurately predicting solubility of an arbitrary solute in an arbitrary solvent, an endeavor not only facilitates pharmaceutical industry to develop better drug candidates, but also increases efficiency regarding overall wet lab work. Key predictors which contribute most to a specific model or model function were identified using both Shapley analysis and correlation-group parallel feature analysis (CG-PFA).