Abstract
Glycosylation represents a major chemical challenge; while it is one of the most common reactions in Nature, conventional chemistry struggles with stereochemistry, regioselectivity and solubility issues. In contrast, family 1 glycosyltransferase (GT1) enzymes can glycosylate virtually any given nucleophilic group with perfect control over stereochemistry and regioselectivity. However, the appropriate catalyst for a given reaction needs to be identified among the tens of thousands of available sequences. Here, we present the Glycosyltransferase Acceptor Specificity Predictor (GASP) model, a data-driven approach to the identification of reactive GT1:acceptor pairs. We trained a random forest-based acceptor predictor on literature data and validated it on independent in-house generated data on 1001 GT1:acceptor pairs, obtaining an AUROC of 0.79 and a balanced accuracy of 72%. GASP is capable of parsing all known GT1 sequences, as well as all chemicals, the latter through a pipeline for the generation of 153 chemical features for a given molecule taking the CID or SMILES as input (freely available at https://github.com/degnbol/GASP). GASP had an 83% hit rate in a comparative case study for the glycosylation of the anti-helminth drug niclosamide, significantly outperforming a hit rate of 53% from a random selection assay. However, it was unable to compete with a hit rate of 83% for the glycosylation of the plant defensive compound DIBOA using expert-selected enzymes, with GASP achieving a hit rate of 50%. The hierarchal importance of the generated chemical features was investigated by negative feature selection, revealing properties related to cyclization and atom hybridization status to be the most important characteristics for accurate prediction. Our study provides a ready-to-use GT1:acceptor predictor which in addition can be trained on other datasets enabled by the automated feature generation pipelines.
Supplementary materials
Title
Appendix 1
Description
Name, CID, and structure of acceptors used in NADH coupled enzyme assay to generate the independent in-house dataset.
Actions
Title
SI for "GASP: A pan-specific predictor of family 1 glycosyltransferase specificity enabled by a pipeline for substrate feature generation and large-scale experimental screening"
Description
Supporting figures and tables for experimental and computational studies.
Actions
Title
dataset1
Description
In-house dataset used to test performance of GASP. Contains enzyme information, substrate information, rate, and reaction classification.
Actions
Supplementary weblinks
Title
GitHub repository for GASP code
Description
GitHub repository for all code related to the installation, training, optimisation, application, and validation of the GASP model.
Actions
View