GASP: A pan-specific predictor of family 1 glycosyltransferase specificity enabled by a pipeline for substrate feature generation and large-scale experimental screening

03 November 2023, Version 1

Abstract

Glycosylation represents a major chemical challenge; while it is one of the most common reactions in Nature, conventional chemistry struggles with stereochemistry, regioselectivity and solubility issues. In contrast, family 1 glycosyltransferase (GT1) enzymes can glycosylate virtually any given nucleophilic group with perfect control over stereochemistry and regioselectivity. However, the appropriate catalyst for a given reaction needs to be identified among the tens of thousands of available sequences. Here, we present the Glycosyltransferase Acceptor Specificity Predictor (GASP) model, a data-driven approach to the identification of reactive GT1:acceptor pairs. We trained a random forest-based acceptor predictor on literature data and validated it on independent in-house generated data on 1001 GT1:acceptor pairs, obtaining an AUROC of 0.79 and a balanced accuracy of 72%. GASP is capable of parsing all known GT1 sequences, as well as all chemicals, the latter through a pipeline for the generation of 153 chemical features for a given molecule taking the CID or SMILES as input (freely available at https://github.com/degnbol/GASP). GASP had an 83% hit rate in a comparative case study for the glycosylation of the anti-helminth drug niclosamide, significantly outperforming a hit rate of 53% from a random selection assay. However, it was unable to compete with a hit rate of 83% for the glycosylation of the plant defensive compound DIBOA using expert-selected enzymes, with GASP achieving a hit rate of 50%. The hierarchal importance of the generated chemical features was investigated by negative feature selection, revealing properties related to cyclization and atom hybridization status to be the most important characteristics for accurate prediction. Our study provides a ready-to-use GT1:acceptor predictor which in addition can be trained on other datasets enabled by the automated feature generation pipelines.

Keywords

Glycosyltransferases
Machine Learning
High-throughput Assay
Enzyme Specificity

Supplementary materials

Title
Description
Actions
Title
Appendix 1
Description
Name, CID, and structure of acceptors used in NADH coupled enzyme assay to generate the independent in-house dataset.
Actions
Title
SI for "GASP: A pan-specific predictor of family 1 glycosyltransferase specificity enabled by a pipeline for substrate feature generation and large-scale experimental screening"
Description
Supporting figures and tables for experimental and computational studies.
Actions
Title
dataset1
Description
In-house dataset used to test performance of GASP. Contains enzyme information, substrate information, rate, and reaction classification.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.