Abstract
Coarse-grained (CG) models simplify molecular representations by grouping multiple atoms into effective particles, enabling faster simulations and reducing the chemical compound space compared to atomistic methods. Additionally, models with chemical specificity, such as Martini, may extrapolate to cases where experimental data is scarce, making CG methods highly promising for high-throughput (HT) screenings and chemical space exploration. Yet no rigorous data formats exist for the crucial aspect of describing how the atoms are grouped (i.e., the mapping). As CG models advance toward true HT capabilities, the lack of mappings and indexing capabilities for the growing number of CG molecules poses a significant barrier. To address this, we introduce CGsmiles, a versatile line notation inspired by the popular Simplified Molecular Input Line Entry System (SMILES) and BigSMILEs. CGsmiles encodes the molecular graph and particle (atom) properties independent of their resolution and incorporates a framework that allows seamless conversion between coarse- and fine-grained resolutions. By specifying fragments that describe how each particle is represented at the next finer resolution (e.g. CG particles to atoms), CGsmiles can represent multiple resolutions and their hierarchical relationships in a single string. In this paper, we present the CGSmiles syntax and analyze a benchmark set of 407 molecules from the Martini force field. We highlight key features missing in existing notations that are essential for accurately describing CG models. To demonstrate the utility of CGsmiles beyond simulations, we construct two simple machine-learning models for predicting partition coefficients, both trained on CGsmiles-indexed data and leveraging information from both CG and atomistic resolutions. Finally, we briefly discuss the applicability of CGsmiles to polymers, which particularly benefit from the multiresolution nature of the notation.
Supplementary materials
Title
Article Supporting Information
Description
Common mapping file formats; assignment of cis/trans isomers and mapping of sterols in Martini 3
Actions