Abstract
Finding new medicines is one of the most important tasks of pharmaceutical companies. One of the best approaches to finding a new drug starts with answering this simple question: Given a known effective drug X, what are the top 100 molecules in our database most similar to X? Thus the essence of the problem is a nearest-neighbors search, and the key question is how to define the distance between two molecules in the database. In this paper, we investigate the use of topological, rather than geometric, or chemical, signatures for molecules, and two notions of distance that come from comparing these topological signatures. We introduce PH_VS (Persistent Homology for Virtual Screening), a new system for ligand-based screening using a topological technique known as multi-parameter persistent homology. We show that our approach can match or exceed a reasonable estimate of current state of the art (including well-funded commercial tools), even with relatively little domain-specific tuning. Indeed, most of the components we have built for this system are general-purpose tools for data science and will be released soon as open source software.