Abstract
How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery under the similarity principle assumption. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant similar chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed, regularized via a similarity kernel function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ billion pairs. The regularization term enforces a direct relationship between the probability of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their probability and accordingly by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the probability of generating a target molecule and its similarity to the source molecule. The trained transformer model is able to exhaustively sample a near-neighborhood around a given drug-like molecule.