Abstract
Machine learning algorithms have shown great accuracy in predicting chemical reaction outcomes and retrosynthesis. However, designing synthesis pathways remains challenging for existing machine learning models which are trained for single-step prediction. In this manuscript, we propose a new approach by recasting the retrosynthesis problem as a string optimization problem, leveraging the similarity between chemical reactions and multidimensional geometrical vectors. Based on this premise, multi-step complex synthesis can be conceptualized as sequences that link multidimensional vectors (fingerprints) representing individual chemical reaction steps. We extracted an extensive corpus of chemical synthesis from patents and converted them into multi-dimensional strings. While optimizing the retrosynthetic path, we use the Euclidean metric to minimize the distance between the expanded trajectory of the growing retrosynthesis string and the corpus of extracted strings. By doing so, we promote the assembly of synthetic pathways that, in the chemical reaction space, will be more similar to existing retrosynthesis, thereby inheriting the strategic guidelines designed by human experts. We integrated this approach into the RXN platform (https://rxn.res.ibm.com/) and present the method’s application to complex synthesis as well as its ability to produce better synthetic strategies than current methodologies.