Abstract
We describe a method for learning higher-level vector representations of interactions between molecular features and biology. We named the representations as the reason vectors. In contrast to the high-dimensional chemical fingerprints, reason vectors are much simpler with only about 5 dimensions. They allow abstract reasoning for bioactivity of chemicals or absence thereof, uncover causal factors in interactions between chemical features and generalize beyond specific chemical classes or bioactivity. These qualities enable us to perform powerful similarity searches that are vague and conceptual in nature. The methodology can handle novel combinations of features in query molecules and can evaluate chemical classes that are entirely absent in training data. The method consists of similarity-based near neighbor search on a reference database of biologically tested chemicals by a series of substructures obtained from stepwise reconstruction of the test molecule. A data-driven continuous representation of molecular fragments was used for molecular similarity computations. The technique was inspired by the ability of humans to learn and generalize complex concepts by interacting with the physical world. We also show that activity prediction of chemicals using the abstract reason vectors is very easy and straightforward, as compared to modeling in the raw chemistry space, and can be applied to both binary and continuous activity outcomes. Except for utilizing an unsupervised training to construct continuous molecular fingerprints, the methodology is devoid of gradient optimization or statistical fitting.