Abstract
Protein acetylation is one of the extensively studied post-translational modifications (PTMs) for its sig- nificant roles across a myriad of biological processes. Although many computationl tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address those problems, we have contributed to both dataset creation and predictor benchmark in this study. Firstly, we construct a non-histone acetylation site bench- mark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose a transformer-based neural network model, TransPTM, for non-histone acetylation site predication. Our model introduces a pre-trained protein language model ProtT5 to con- struct the site’s feature space. The GNN framewrk consists of three TransformerConv layers for feature extraction and a multilayer perceptron (MLP) module for classification. In experiments, TransPTM has the competitive performance for non-histone acetylation site prediction over 3 SOTA tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM.