Leveraging High-throughput Molecular Simulations and Machine Learning for Formulation Design

Alex K. Chew; Mohammad Atif Faiz Afzal; Zach Kaplan; Eric M. Collins; Suraj Gattani; Mayank Misra; Anand Chandrasekaran; Karl Leswing; Mathew D. Halls

doi:10.26434/chemrxiv-2024-4lff6-v2

Materials Science

Search within Materials Science

Leveraging High-throughput Molecular Simulations and Machine Learning for Formulation Design

28 October 2024, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Formulations, or mixtures of chemical ingredients, are ubiquitously found across material science applications, such as themoplastics, consumer packaged goods, and energy storage devices. However, finding formulations with optimal properties is difficult because of the non-obvious connection between the individual ingredient structures and compositions to downstream mixture properties. Computational approaches that could traverse the expansive design space offer a promising solution to finding formulations with improved properties while minimizing the number of experiments. In this work, we generated a large formulation dataset using high-throughput classical molecular dynamics simulations that resulted in more than 30,000 solvent mixtures ranging between pure component to five component systems. We developed three formulation-property relationship approaches to create machine learning models which use the ingredient structure and composition as input to predict a formulation property: formulation descriptor aggregation (FDA), formulation descriptor Set2Set (FDS2S), and formulation graph (FG). We found that FDS2S, a new approach that uses a Set2Set layer to aggregate molecular descriptors of individual ingredients, outperforms all other approaches in accurately predicting density, heat of vaporization, and enthalpy of mixing that were computed from molecular simulations. Feature importance analysis of FDA models reveal that specific substructures are important to predicting these formulation properties, which is useful in the design of formulations to achieve target properties. When leveraging an active learning framework to iteratively suggest the next ingredient and composition to experiment on, we found that formulation-property relationships can identify formulations with the highest property values at least two to three times faster than randomly guessing, demonstrating that machine learning models can provide valuable insight to suggest the next experiment even when starting from a limited dataset of ~100 examples. Finally, we found that formulation-property models can be applied to experimental datasets from the literature to accurately predict binary liquid viscosities, drug solubility in single or binary mixtures, and motor octane number in hydrocarbon mixtures. Our research demonstrates the utility of high-throughput simulations and machine learning algorithms applied to designing formulations with promising properties, which could broadly accelerate the design of new materials for a wide range of applications, such as improving the performance of liquid electrolytes for batteries, fuel mixtures for oil and gas, solvent additives for perfumes or paints, and more.

Keywords

Formulations

Chemical Mixtures

Classical Molecular Dynamics Simulations

Formulation-Property Relationships

Quantitative Structure-Property Relationships

Machine Learning

Supplementary materials

Title

Description

Actions

Title

Supplementary information document

Description

The supporting information contains the comparison of formulation labels between molecular dynamics simulations and experiments, analysis of miscibility for binary mixtures using molecular dynamics simulations, best hyperparameters of formulation-property models when trained with 90% of the data, and description of the formulation dataset generated in this work and the curated literature datasets.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jan 08, 2025 Version 4

Nov 11, 2024 Version 3

Oct 28, 2024 Version 2

Jun 18, 2024 Version 1

Version Notes

We added a new Table 2 and Fig. 8 in the main text, which benchmarks the formulation machine learning workflow on three experimental literature datasets: (1) prediction of viscosity of pure and binary solvents as a function of temperature, (2) prediction of drug solubility in single or binary solvents as a function of temperature, and (3) prediction of motor octane number of pure solutions and mixtures of hydrocarbons. The new results demonstrate that the machine learning tools developed in this work can accurately predict experimental properties of complex mixtures.

Metrics

3,993

1,755

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2024-4lff6-v2

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Leveraging High-throughput Molecular Simulations and Machine Learning for Formulation Design

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Version Notes

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share