Abstract
SMILES-based deep learning models are slowly emerging
as an important research topic in cheminformatics. In this study, we introduce
SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE first
learns a vocabulary of high frequency SMILES substrings from a large chemical
dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned
vocabulary for deep learning models. As a result, SPE augments the widely used
atom-level tokenization by adding human-readable and chemically explainable
SMILES substrings as tokens. Case studies show that SPE can achieve superior
performances for both molecular generation and property prediction tasks. In
molecular generation task, SPE can boost the validity and novelty of generated
SMILES. Herein, the molecular property prediction models were evaluated using
24 benchmark datasets where SPE consistently either did match or outperform
atom-level tokenization. Therefore SPE could be a promising tokenization method
for SMILES-based deep learning models. An open source Python package SmilesPE
was developed to implement this algorithm and is now available at https://github.com/XinhaoLi74/SmilesPE.