Abstract
How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed. The transformer model have a similarity kernel based regularization function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ 200 billion pairs. The regularization term enforces a direct relationship between the log-likelihood of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their log-likelihood and therefore by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the log-likelihood of generating a target compound and its similarity to the source compound. The resulting model is able to exhaustively sample a near-neighborhood around a drug-like molecule.