Abstract
We present SMILES-embeddings derived from internal encoder state of a Transformer model trained to canonize SMILES as a Seq2Seq problem. Using CharNN architecture upon
the embeddings results in a higher quality QSAR/QSPR models on diverse benchmark datasets
including regression and classification tasks. The proposed Transformer-CNN method uses
SMILES augmentation for training and inference, and thus the prognosis grounds on an internal
consensus. Both the augmentation and transfer learning based on embedding allows the
method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code
and the embeddings are available on https://github.com/bigchem/transformer-cnn, whereas the
OCHEM environment (https://ochem.eu) hosts its on-line implementation.