Abstract
It is a core problem in any field to reliably tell how
close two objects are to being the same, and once this relation has been
established we can use this information to precisely quantify potential
relationships, both analytically and with machine learning (ML). For inorganic solids,
the chemical composition is a fundamental descriptor, which can be represented
by assigning the ratio of each element in the material to a vector. These
vectors are a convenient mathematical data structure for measuring similarity,
but unfortunately, the standard metric (the Euclidean distance) gives little to
no variance in the resultant distances between chemically dissimilar
compositions. We present the Earth Mover’s Distance (EMD) for inorganic
compositions, a well-defined metric which enables the measure of chemical
similarity in an explainable fashion. We compute the EMD between two
compositions from the ratio of each of the elements and the absolute distance between
the elements on the modified Pettifor scale. This simple metric shows clear
strength at distinguishing compounds and is efficient to compute in practice.
The resultant distances have greater alignment with chemical understanding than
the Euclidean distance, which is demonstrated on the binary compositions of the
Inorganic Crystal Structure Database (ICSD). The EMD is a reliable numeric
measure of chemical similarity that can be incorporated into automated
workflows for a range of ML techniques. We have found that with no supervision
the use of this metric gives a distinct partitioning of binary compounds into
clear trends and families of chemical property, with future applications for nearest
neighbor search queries in chemical database retrieval systems and supervised
ML techniques.
Supplementary materials
Title
emd si
Description
Actions