DP-OTG: A Feature-Free Deep Learning Model for Accurate Prediction of Human O-Linked Threonine Glycosylation Sites

Abstract

Protein O-linked threonine glycosylation (OTG) is a crucial post-translational modification in eukaryotic species, playing a vital role in diverse biological processes. In humans, dysregulation of OTG has been associated with serious diseases, including cancer and neurological disorders. However, experimental detection of OTG sites remains costly and labor-intensive, underscoring the need for effective computational approaches. In this study, we introduce DP-OTG, a feature-free deep learning model for the accurate prediction of human OTG sites. Unlike existing tools that rely heavily on handcrafted features or large-scale pretrained language models, DP-OTG employs a hybrid architecture combining multi-kernel convolutional neural networks, bidirectional long short-term memory, and a trainable embedding layer to automatically learn sequence patterns directly from raw protein sequences. This end-to-end framework captures both local and long-range sequence dependencies without the need for manual feature engineering. Extensive evaluations using 10-fold cross-validation and independent testing demonstrate that DP-OTG achieves superior predictive performance, with an accuracy of 88.8% and an Matthew’s Correlation Coefficient (MCC) of 0.776 on the balanced test set, and an accuracy of 89.3% and an MCC of 0.661 on the imbalanced test set, outperforming several state-of-the-art predictors. In addition, to comprehensively assess the discriminative power and generalization ability of DP-OTG in predicting human OTG sites, we employed t-distributed stochastic neighbor embedding to visualize the feature representations before and after training. These results underscore the effectiveness of DP-OTG in extracting robust features for accurate OTG site prediction, even under challenging data distributions. Our findings highlight DP-OTG as a robust, efficient, and scalable tool for human OTG site prediction. All the code and resources related to this study have been made freely accessible at: https://github.com/nuinvtnu/DP-OTG/.

Keywords

bidirectional long short-term memory (Bi-LSTM)1D convolutional neural networks (1D-CNN)human O-Linked threonine glycosylation (OTG)natural language processing post-translational modifications (PTMs)prediction

1. INTRODUCTION

Glycosylation is a complex post-translational modification (PTM) that plays a vital role in protein folding, stability, and cellular communication (Hart, 1992; Haltiwanger and Lowe, 2004; Arey, 2012; Jayaprakash and Surolia, 2017). It exists in two primary forms: N-linked and O-linked glycosylation. Unlike N-linked glycosylation, O-linked glycosylation lacks a conserved sequence motif, making its site-specific prediction particularly challenging (Lis and Sharon, 1993). Notably, glycosylation has been observed in human cancer cells for decades and has also been implicated in Alzheimer’s disease, highlighting its potential significance in disease mechanisms (Schedin‐Weiss et al., 2014; Oliveira-Ferrer et al., 2017; Reily et al., 2019; Magalhães et al., 2021). However, experimental methods for identifying O-glycosylation sites are labor-intensive and costly, underscoring the need for efficient computational models.

Several machine learning-based methods have been developed to predict O-linked glycosylation sites, such as GlycoMine (Li et al., 2015), GlycoEP (Chauhan et al., 2013), OGP (Huang et al., 2021), and NetOGlyc-4.0 (Steentoft et al., 2013). These tools leverage various feature extraction techniques and classification algorithms to identify glycosylation sites with different levels of accuracy and efficiency. For instance, GlycoMine integrates multiple sequence- and structure-based features, while GlycoEP employs an ensemble approach to enhance predictive performance. OGP utilizes physicochemical properties and evolutionary information, whereas NetOGlyc-4.0 applies artificial neural networks to capture complex sequence patterns. Despite their effectiveness, many of these methods rely on handcrafted feature engineering, which can be time-consuming and computationally expensive. Moreover, their generalizability across different datasets remains a challenge, highlighting the need for more robust and efficient predictive models.

Notably, four recent tools, O-GlyThr (Tang et al., 2023), HOTGpred (Pham et al., 2024), Stack-OglyPred-PLM (Pakhrin et al., 2024), and DOGpred (Lee et al., 2025), have been proposed for O-GlyThr site prediction. O-GlyThr utilized seven handcrafted features combined with traditional classifiers like Random Forest. More recent approaches have integrated pretrained protein language models (PLMs) to extract informative sequence embeddings. HOTGpred, published in 2024, employs a handcrafted feature selection strategy, integrating 25 features, including 14 PLM-based embeddings (e.g., CPCProt, ProtTransXLNetUniRef100, Word2Vec), and 11 conventional descriptors (e.g., AAIndex1, composition-transition-distribution descriptors). Additionally, it applies machine learning models to refine feature selection, resulting in a high-dimensional representation (4864 features), and utilizes the XGBoost algorithm for classification. Stack-OglyPred-PLM, published in 2024, utilizes the ProtT5-XL-UniRef50 protein language model to generate contextualized embeddings for sites of interest (“S/T”), which are then fed into a Meta-Ensemble Model. A stacked generalization (meta-ensemble) approach is applied, where two Multi-Layer Perceptron models trained on ProtT5 and Ankh PLM embeddings are combined using a meta-model to optimize prediction accuracy. Meanwhile, DOGpred, published in 2025, follows a similar approach but with a more compact feature set of 18 features, comprising 9 conventional descriptors (e.g., AAIndex, Binary) and 9 embeddings from pretrained models (e.g., ESB, ESM, PTAB). DOGpred integrates a hybrid deep learning model combining convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM), offering an alternative predictive framework.

However, the above methods rely on manual feature engineering, which introduces subjectivity, demands extensive human effort, and separates feature extraction from model construction. Additionally, these tools require substantial computational resources for feature extraction, as they utilize large-scale PLMs, leading to long processing times. Despite advancements, existing methods still depend on manual feature selection, potentially limiting their ability to capture intricate sequence patterns comprehensively. Building on our previous research on PTMs and the development of hybrid deep learning models (Nguyen et al., 2017; Kao et al., 2020; Tran et al., 2023; Nguyen et al., 2024; Tran et al., 2025a), we aim to enhance and achieve more significant advancements in the prediction of protein O-linked Threonine Glycosylation (OTG) sites. In this study, we introduce DP-OTG, a feature-free deep learning model designed for accurate prediction of human OTG sites. This novel end-to-end architecture integrates deep learning with natural language processing (NLP) techniques, eliminating the need for handcrafted features. Our approach leverages a combination of multi-kernel 1D-CNN, Bi-LSTM networks, and NLP-based sequence encoding to automatically capture both local patterns and long-range dependencies within protein sequences. This integrated design enables the model to effectively identify diverse glycosylation-related motifs across multiple scales while maintaining the semantic integrity of the sequences.

Through extensive experiments, we demonstrate that our proposed model outperforms state-of-the-art methods on both balanced and imbalanced datasets, highlighting its effectiveness and robustness in O-glycosylation site prediction.

2. METHODS

2.1. Data collection and preprocessing

In this study, the datasets of human OTG were collected from HOTGpred (Pham et al., 2024), DOGpred (Lee et al., 2025), O-GlyThr (Tang et al., 2023), UniProt database (UniProt Consortium, 2024), and relative literatures. After some technical steps to remove redundant data, we decided to utilize the same dataset as the most recent studies used on human OTG sites prediction, including HOTGpred (Pham et al., 2024), DOGpred (Lee et al., 2025), and O-GlyThr (Tang et al., 2023). This resulted in 318 human OTG proteins. The decision to adopt this dataset in our study was driven by several key factors. First, using the same dataset as previous state-of-the-art models ensures a fair and direct comparison of model performance. Second, its widespread adoption establishes it as a standard reference for evaluating new models. Lastly, the inclusion of an imbalanced test set allows for a more realistic assessment of the model’s ability to handle scenarios in which human non-OTG sites significantly outnumber human OTG sites—an essential consideration for practical applications. By leveraging this dataset, we ensure consistency in evaluation and provide meaningful comparisons with the latest advancements in OTG prediction.

To remove duplicate and redundant proteins, the CD-HIT tool (Km et al., 2020) was applied with a 40% sequence identity threshold, which refined the dataset to 246 unique human proteins. Since this study focuses on the sequence-based characterization of threonine (T) sites and their substrate specificities, extracting sequence fragments from the full FASTA sequence of proteins is necessary. It is worth noting that this study specifically focuses on O-linked glycosylation occurring at threonine (Thr) residues. This decision was made to ensure consistency with previous benchmark datasets (HOTGpred, DOGpred, and O-GlyThr), which were also designed exclusively for threonine-based glycosylation prediction. Furthermore, the number of experimentally verified Thr sites in humans is considerably higher and more reliable than Ser sites, providing a stronger foundation for model training and evaluation. Future studies will aim to extend this framework to include serine and mixed O-linked glycosylation sites for a more comprehensive analysis. Consistent with previous research, positive samples were constructed by extracting sequence fragments using a window size of $2 n + 1$ , centering the threonine (T) residue that has been experimentally verified as a OTG site, with $n$ upstream and $n$ downstream residues. For fragments with fewer than $n$ residues on either side, the pseudo-amino acids (“X”) were added to standardize the length to $2 n + 1$ .

Based on preliminary technical steps to investigate with different window sizes, the optimal window size was found to be 41; therefore, we decided to utilized window size of 41 to extract peptides from full FASTA protein sequences for this study. For negative samples, threonine (T) residue not experimentally verified as OTG sites were treated as non-OTG sites, and the same window size of $2 n + 1$ was applied to extract negative fragments. To prevent overfitting and enhance the model’s generalization, sequence fragments were filtered using CD-HIT tool (Km et al., 2020) with a 40% identity cutoff. The final datasets used in this study were summarized as in Table 1.

Table 1.
Training Dataset and Testing Dataset to Use in This Study

Dataset Protein Positive sites Negative sites

Initial dataset 318 1078 110,923

Filtered dataset (CD-HIT, 40% threshold), avoid potential bias during the model development, randomly Negative 246 1078 1078

Training — 878 878

Balanced test — 200 200

Imbalanced test — 200 1000

Dataset	Protein	Positive sites	Negative sites
Initial dataset	318	1078	110,923
Filtered dataset (CD-HIT, 40% threshold), avoid potential bias during the model development, randomly Negative	246	1078	1078
Training	—	878	878
Balanced test	—	200	200
Imbalanced test	—	200	1000

Through the process of extracting peptides and removing homologous data, the final dataset consisted of 1078 positive and 110,923 negative samples. To mitigate class imbalance, 1078 negative samples were randomly selected to balance the dataset, forming a training set of 878 positive and 878 negative samples, along with a balanced test set of 200 positive and 200 negative samples. Additionally, an imbalanced independent test set, introduced by Pham, N.T et al. (Pham et al., 2024), extends the balanced test set by incorporating 800 additional negative samples randomly selected from the original negative pool, reflecting real-world data distribution.

2.2. Features extraction and encoding

Similar to human natural language, protein sequences can be naturally represented as character strings. The protein alphabet consists of 20 common amino acids, excluding rare and uncommon ones. Like natural language, naturally evolved proteins exhibit modular structures with slight variations, arranged hierarchically. In this analogy, common protein motifs and domains, which serve as fundamental functional units, resemble words, phrases, and sentences in human language. This structural similarity makes NLP-based encoding a powerful approach for protein sequence analysis (Ofer et al., 2021).

In this study, protein sequences are first tokenized using the 1-gram technique, where each amino acid is treated as an individual unit. Each amino acid is subsequently assigned an integer index based on a predefined amino acid dictionary before being passed to the embedding layer. This choice ensures that the encoding fully preserves sequence order and primary structure information without introducing artificial dependencies. The tokenized sequences are then processed through the Embedding layer of a deep learning model, which maps each amino acid to a continuous vector space, enabling the model to capture biological relevance directly from data.

Unlike manual feature extraction methods that rely on predefined biochemical properties, this approach allows the model to learn adaptive representations through training. The Embedding Layer functions as a dynamic lookup table, refining vector representations as training progresses. Through forward propagation, loss computation, and backpropagation, embeddings evolve to capture sequence patterns critical for PTM site prediction. This integration of deep learning with NLP-based techniques enables automatic extraction of key sequence features, enhancing predictive performance without the need for handcrafted rules. The process of feature encoding using NLP techniques (Fig. 1) is summarized in two main steps as follows:

FIG. 1.

Illustration of the n-gram–based tokenization and embedding process used in the DP-OTG model.

To illustrate the tokenization and embedding procedure, Figure 1 presents an overview of the n-gram-based encoding concept, in which protein sequences can theoretically be decomposed into overlapping residue fragments (e.g., 1-gram, 2-gram, or 3-gram). Each token is then converted into a numerical index based on a predefined amino acid vocabulary and subsequently mapped to a dense vector through the Embedding layer. Although various n-gram examples are depicted to clarify the general principle, in this study, we employed only the 1-gram representation, as it provided the most informative and stable features according to our previous comparative experiments (Nguyen et al., 2024; Tran et al., 2025a, 2025b).

First, tokenization using the n-gram method. For example, in 1-gram tokenization, each amino acid is converted into a single token. This process utilizes a dictionary containing the 20 standard amino acids, with a special “X” representing unidentified amino acids, to generate a numerical vector in the form of $(x_{1}, x_{2}, \dots, x_{L})$ , where L denotes the length of the protein sequence or peptide.

Step 2: The tokens obtained in Step 1 are mapped to embedding vectors $(e_{1}, e_{2}, \dots, e_{L})$ through an Embedding layer with a dimension of 300. These embedding vectors are not fixed but are optimized during model training using the backpropagation algorithm. This allows the model to learn the best representation for each amino acid in the context of the protein sequence.

2.3. Model construction, learning and evaluation

In an attempt to accurately and effectively predict human OTG sites, the hybrid deep learning architecture, named DP-OTG, has been designed. Our approach combines a five-layer 1 D-CNN, operating in parallel with three different kernel sizes, and a Bi-LSTM network, leveraging NLP-based word embedding techniques to automatically learn meaningful sequence representations from raw protein sequences. As depicted in Figure 2, the proposed model consists of three main parts: data collection and pre-processing; data encoding using NLP techniques; and network architecture.

FIG. 2.

The system architecture of the proposed model.

The model consists of three parallel CNN branches, each using a different kernel size (3, 5, and 7) to extract features at different levels. Each CNN branch includes five Conv1D layers with 32 filters, ReLU activation, and “same” padding to preserve the output size. After the convolutional layers, each branch applies a MaxPooling1D layer with a pool size of 2 to reduce dimensionality, followed by a Dropout layer (rate of 0.4) to prevent overfitting, and a BatchNormalization layer to normalize activations. In addition to the CNN branches, the model includes a Bi-LSTM branch, which applies a Bi-LSTM layer with 32 hidden units to the embedded sequence, capturing contextual information in both forward and backward directions. This branch also incorporates a Dropout layer (rate of 0.4) and a BatchNormalization layer for improved training stability. The outputs from the three CNN branches and the Bi-LSTM branch are flattened using Flatten layers and then merged using a Concatenate layer. The fully connected network consists of two Dense layers with 128 and 64 units, respectively, both using ReLU activation and L2 regularization (coefficient 0.01) to mitigate overfitting. Each dense layer is followed by a Dropout layer with a dropout rate of 0.5. Finally, the output layer consists of 2 units with a softmax activation function, enabling the model to perform binary classification. The model uses a standard decision boundary of 0.5, where the class with a predicted probability greater than 0.5 is assigned as the positive class.

To provide a fair and comprehensive comparison, a sequential baseline model was additionally constructed. In this baseline architecture, three parallel branches were designed, each consisting of a 1D-CNN module with a kernel size of 3, 5, or 7, followed directly by a Bi-LSTM layer (CNN3 → Bi-LSTM, CNN5 → Bi-LSTM, and CNN7 → Bi-LSTM). Each CNN module shares the same configuration as the CNN branches in DP-OTG, including five Conv1D layers, max-pooling, dropout, and batch normalization. The Bi-LSTM layer in each branch contains 32 hidden units to capture long-range dependencies after kernel-specific convolutional feature extraction. The output features from the three sequential branches are then flattened and concatenated, followed by two fully connected Dense layers with 128 and 64 units, respectively, before the final softmax output layer. This sequential design serves as a baseline to investigate the impact of learning temporal dependencies after convolutional feature extraction, in contrast to the parallel feature learning strategy adopted in DP-OTG.

By integrating deep learning with NLP-based feature extraction, DP-OTG establishes a new benchmark for glycosylation site prediction, outperforming existing tools while maintaining computational efficiency. The core components of DP-OTG, along with its parameters and configurations, are summarized as in Table 2.

Table 2.

The Information of DP-OTG’s Components and Its Parameters

Component	Parameters and configuration details
Input	Sequence with a maximum length of max_sequence_length, vocabulary size vocab_size
Embedding layer	300-dimensional embedding space, trainable = True
Dropout	0.3
CNN Branch1 (Kernal = 3)	5 Conv1D layers (32 layers, kernal size 3, Relu activation, same Padding) MaxPooling1D (pool size = 2) Dropout(0.4) BatchNormalization
CNN Branch2 (Kernal = 5)	5 Conv1D layers (32 layers, kernal size 5, Relu activation, same Padding) MaxPooling1D (pool size = 2) Dropout(0.4) BatchNormalization
CNN Branch3 (Kernal = 7)	5 Conv1D layers (32 layers, kernal size 7, Relu activation, same Padding) MaxPooling1D (pool size = 2) Dropout(0.4) BatchNormalization
Bi-LSTM Branch	Bi-LSTM (32 units) Dropout (0.4) BatchNormalization
Merging	Flatten outputs from all CNN and Bi-LSTM branches Concatenate layer to merge features
DenseLayer1	128 units, Relu activation, L2 regularization (coefficient = 0.01)
Dropout(Dense1)	0.5
DenseLayer2	64 units, Relu activation, L2 regularization (coefficient = 0.01)
Dropout(Dense2)	0.5
DenseLayer3 (Final Output)	2, Softmax activation

Optimization Parameters: Adam optimizer, learning rate = 0.001, 200 epochs, using early stopping, batch size = 16.

The operation and prediction mechanism of the proposed model is carried out in four steps: (1) Create 1-gram dictionary; (2) Sequences encoding; (3) Train and validate the predictive model; and (4) Make predictions and get predicted outputs. Detailed information is illustrated in Algorithm DP-OTG, as shown in Figure 3.

FIG. 3.

Algorithm DP-OTG: Accurate prediction of human OTG sites.

The overall architecture and operational mechanism of the proposed model confer significant advantages, enabling it to outperform existing prediction tools for human OTG site identification. The key advantages of the DP-OTG model are summarized as follows:

Automatic Feature Extraction: Unlike traditional approaches that depend on handcrafted features, DP-OTG automatically learns sequence patterns directly from raw input data.

Trainable Word Embeddings: The embedding layer is optimized during training, enabling the model to learn task-specific representations for human OTG sites prediction.

Hybrid Learning Strategy: By combining 1D-CNN to capture local sequence motifs and Bi-LSTM to model long-range dependencies, the model achieves a more comprehensive and robust representation of protein sequences.

Computational Efficiency: In contrast to large-scale PLMs, DP-OTG model provides a lightweight and scalable solution, making it well-suited for large-scale PTM prediction tasks.

In order to evaluate the performance of the predictive models, the 10-fold cross-validation approach has been performed to assess the classifying power of the predictive models. The following measurements are commonly used to evaluate the performance of the constructed models: Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), l, and F1-score, area under the curve (AUC).

Sensitivity (SEN) = \frac{T P}{T P + F N}

(1)

Specificity (SPE) = \frac{T N}{T N + F P}

(2)

(A C C) = \frac{T P + T N}{T P + F N + T N + F P}

(3)

(MCC) = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(4)

Recall = \frac{T P}{T P + F N}

(5)

Precision = \frac{T P}{T P + F P}

(6)

F 1 - Score = 2 \frac{Precision * Recall}{Precision + Recall}

(7)

Herein, TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively.

After running a 10-fold cross-validation process, the predictive model with the highest MCC and accuracy values is selected as the optimal model for identifying human OTG sites. Additionally, an independent testing approach was conducted to evaluate the selected model’s performance in a real-case scenario. Furthermore, a comparison between our proposed model and a recent relevant predictor of human OTG sites was performed to assess the practical effectiveness of the proposed model.

3. RESULTSANDDISCUSSION

3.1. Dataset analysis

To examine the position-specific amino acid composition for Human OTG sites, WebLogo (Crooks et al., 2004) was applied to generate the graphical sequence logo, visualizing the relative frequency of amino acids at positions surrounding glycosylation sites. The sequence entropy plots in WebLogo clearly depicted the conserved motifs within the substrate sequences, allowing for the identification of amino acid preferences around glycosylation sites.

The comparative analysis of amino acid composition between human OTG sites (positive data) and non-OTG (negative data) revealed notable differences. As illustrated in Figure 4A, the most prominent amino acids in human OTG sites included Threonine (T), Serine (S), and Proline (P), while amino acids such as Tryptophan (W), Cysteine (C), and Methionine (M) were observed less frequently. Although Threonine (T), Serine (S), and Proline (P) appeared frequently in both positive and negative datasets, their occurrence was nearly twice as high in the positive dataset.

FIG. 4.

Frequency of the amino acid composition surrounding the human OTG sites: (A) The frequency of amino acids in the positive and negative training datasets; (B) Two-sample logo-a visualization of the differences between the positive and negative training datasets.

To further distinguish glycosylation from non-glycosylation sequences, a TwoSampleLogo (Vacic et al., 2006) analysis was performed, as shown in Figure 2B. The enriched residues in glycosylation sites predominantly included Threonine (T), Proline (P), and Serine (S), while the depleted residues were primarily nonpolar amino acids such as Leucine (L), Glycine (G), and Valine (V). These findings provide valuable insights into the sequence characteristics of human OTG sites and contribute to the development of predictive models for identifying human OTG sites.

3.2. Performance evaluation of cross-validation

As summarized in Table 3, Figure 5 and Figure 6, the 10-fold cross-validation results demonstrate the effectiveness of different 1D-CNN models and Bi-LSTM in human OTG sites prediction. Various CNN architectures with different kernel sizes were evaluated based on key classification metrics, including Specificity (SPE), Sensitivity (SEN), Accuracy (ACC), MCC, and F1-score. To build an optimal predictive model, we conducted experiments on individual 1D-CNN models with different kernel sizes (3, 5, 7) as well as hybrid 1D-CNN architectures that combined these kernels. We identified the highest-performing hybrid 1D-CNN architecture and then integrated it with the Bi-LSTM network to capture a more comprehensive set of features from the data.

FIG. 5.

Comparison of MCC values across all predictive models for OTG site identification using 10-fold cross-validation. MCC, Matthew’s Correlation Coefficient.

FIG. 6.

The ROC curve of DP-OTG model evaluated by 10-fold cross-validation.

Table 3.

Performance Evaluation by 10-Fold Cross-Validation

Model	SPE (%)	SEN (%)	ACC (%)	F1-score (%)	MCC
CNN3	85.31 ± 0.033	79.52 ± 0.058	82.41 ± 0.032	81.90 ± 0.046	0.649 ± 0.060
CNN5	85.99 ± 0.047	79.29 ± 0.048	82.64 ± 0.034	82.05 ± 0.041	0.654 ± 0.070
CNN7	85.76 ± 0.060	79.64 ± 0.046	82.70 ± 0.041	82.16 ± 0.047	0.655 ± 0.083
CNN35	85.54 ± 0.050	80.09 ± 0.058	82.81 ± 0.037	82.34 ± 0.045	0.657 ± 0.074
CNN57	83.03 ± 0.046	82.48 ± 0.037	82.85 ± 0.032	82.72 ± 0.038	0.655 ± 0.061
CNN37	85.19 ± 0.053	79.52 ± 0.043	82.36 ± 0.037	81.85 ± 0.037	0.648 ± 0.054
CNN357	87.59 ± 0.046	79.64 ± 0.058	83.61 ± 0.049	82.94 ± 0.042	0.676 ± 0.084
Bi-LSTM	83.71 ± 0.035	83.73 ± 0.042	83.72 ± 0.050	83.73 ± 0.042	0.674 ± 0.058
Sequential CNN357–Bi-LSTM	86.33 ± 0.04	84.76 ± 0.049	85.54 ± 0.052	85.44 ± 0.035	0.711 ± 0.054
Parallel CNN357–Bi-LSTM (DP-OTG)	86.10 ± 0.032	87.26 ± 0.058	86.68 ± 0.049	86.76 ± 0.039	0.734 ± 0.052

Among the CNN-based models, CNN357, a hybrid 1D-CNN architecture incorporating three different kernel sizes, demonstrated superior performance. Each kernel in CNN357 learns distinct motifs from the data, enhancing the model’s ability to extract relevant sequence patterns. CNN357 achieved the highest accuracy (83.6%) and MCC (0.674), indicating its strong capability in feature extraction. The Bi-LSTM model, designed to capture long-range dependencies, exhibited balanced performance with an accuracy of 83.7% and an MCC of 0.674, further confirming its robustness in distinguishing glycosylation sites.

In addition to the above architectures, we further introduced a sequential baseline model to provide a more comprehensive comparison. In this baseline, three parallel branches were constructed, each consisting of a CNN layer with kernel sizes of 3, 5, and 7, followed by a Bi-LSTM layer (CNN3→Bi-LSTM, CNN5→Bi-LSTM, and CNN7→Bi-LSTM). The learned features from these three sequential branches were concatenated and subsequently passed through two fully connected dense layers to generate the final prediction. As reported in Table 3, this sequential CNN357–Bi-LSTM baseline achieved improved performance compared to individual CNN and Bi-LSTM models, with notable gains in accuracy (85.5%) and MCC (0.711), indicating that modeling long-range dependencies after kernel-specific convolutional feature extraction can enhance feature representation.

The parallel CNN357–Bi-LSTM (DP-OTG) model outperformed all other architectures, achieving the highest values of specificity (86.1%), sensitivity (87.3%), accuracy (86.7%), MCC (0.734), and AUC (0.94). Its superior performance highlights its ability to effectively extract both local sequence motifs and long-range dependencies, which are essential for accurate glycosylation site prediction. Notably, the standard deviations of accuracy and F1-score across all models remained relatively low (mostly below 0.05), indicating consistent performance across folds. The proposed parallel CNN357_Bi-LSTM (DP-OTG) model also exhibited small standard deviations, further confirming its robustness and stability. These results suggest that integrating CNN and Bi-LSTM enhances feature representation, making the DP-OTG the most effective model for glycosylation site prediction.

Comparatively, the superior performance of the parallel DP-OTG model over the sequential baseline suggests that allowing CNN and Bi-LSTM components to learn complementary representations simultaneously is more effective than enforcing a strictly sequential dependency. This parallel design facilitates richer feature interactions between local sequence motifs and long-range contextual information, leading to more discriminative and stable predictions across cross-validation folds.

The relatively low standard deviations across all metrics indicate that the model’s performance is stable and not significantly affected by variations in the training and testing partitions. These findings confirm the generalization ability of the model in distinguishing between positive and negative samples in the context of glycosylation site prediction.

To further evaluate the classification capability of our model, we employed t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the feature representations of the training dataset before and after training. Prior to training, the data points corresponding to positive and negative samples were largely intermixed, indicating limited separability in the raw feature space. However, after training, the t-SNE projection revealed two clearly separated clusters, corresponding to the positive and negative classes. This distinct separation demonstrates that the model successfully learned discriminative features, leading to strong classification performance. The results confirm the model’s effectiveness in capturing meaningful representations that enhance glycosylation site prediction (Detail in Fig. 7).

FIG. 7.

t-SNE visualization of feature representations before and after training. (A) Before training: positive and negative samples are not well separated in the raw feature space. (B) After training: the model learns discriminative features, resulting in clear separation between the two classes, indicating effective classification capability. T-SNE, t-distributed Stochastic Neighbor Embedding.

3.3. Performance evaluation by independent testing approach

To validate the effectiveness of our proposed model in real-world scenarios, we conducted independent testing on two distinct test sets: a Balanced test set and an Imbalanced test set. As summarized in Table 4, Figures 8 and 9, the results demonstrate the superior performance of the DP-OTG model compared with other architectures, reaching the highest values of MCC and AUC on both Balanced and Imbalanced test set.

FIG. 8.

Comparison of model performance based on MCC scores on two independent test sets: one balanced and one imbalanced.

FIG. 9.

The ROC curves of predictive models on the Balanced test set and Imbalanced test set.

Table 4.

Performance Evaluation by Independent Testing (Imbalanced and Balanced Test)

Model	Imbalanced test				Balanced test
Model	SPE (%)	SEN (%)	ACC (%)	F1-score (%)	SPE (%)	SEN (%)	ACC (%)	F1-score (%)
CNN3	85.97	75.82	84.40	60.13	85.50	78.00	81.75	81.04
CNN5	82.85	84.07	83.03	60.59	85.00	79.00	82.00	81.44
CNN7	86.48	77.47	85.08	61.71	85.00	79.50	82.25	81.75
CNN35	81.03	80.22	80.90	56.59	79.00	82.00	80.50	80.79
CNN57	85.17	79.12	84.23	60.89	87.00	81.00	84.00	83.51
CNN37	85.27	77.47	84.06	60.13	86.00	79.50	82.75	82.17
CNN357	85.17	81.87	84.65	62.34	86.00	83.50	84.75	84.56
Bi-LSTM	82.95	84.07	83.12	60.71	82.50	85.50	84.00	84.24
Sequential CNN357–Bi-LSTM	86.68	85.71	86.53	66.40	87.00	87.00	87.00	87.00
Parallel CNN357–Bi-LSTM (DP-OTG)	90.21	84.62	89.34	71.13	91.50	86.00	88.75	88.43

On the balanced test set, DP-OTG achieved the highest AUC (0.941), accuracy (88.8%), and MCC (0.776), surpassing all other models in specificity (91.5%) and sensitivity (86.0%). These results indicate that the model effectively distinguishes glycosylation sites while maintaining a strong trade-off between sensitivity and specificity, ensuring reliable performance across different sequence variations. Similarly, on the imbalanced test set, the DP-OTG had also demonstrated its robustness, achieving the highest accuracy (89.3%) and MCC (0.661). The model outperformed other architectures in specificity (90.2%) and sensitivity (84.6%), proving its generalization ability to real-world data distributions, where glycosylation sites are inherently imbalanced.

Beyond overall accuracy and MCC, recall is a particularly critical metric for imbalanced datasets, as it directly reflects the model’s ability to correctly identify true glycosylation sites. To provide a more comprehensive evaluation, Precision–Recall (PR) curves were plotted for all models on both the balanced and imbalanced test sets, as illustrated in Figure 10. As shown by the recall curves on the imbalanced test set (Fig. 9), the DP-OTG model consistently maintains higher recall values across a wide range of decision thresholds compared with all baseline models. This indicates that DP-OTG is more effective at reducing false negatives, which is essential for practical glycosylation site prediction tasks. The proposed DP-OTG model consistently achieved the highest average precision, reaching 0.942 on the balanced set and 0.806 on the imbalanced set, outperforming all baseline architectures. Notably, DP-OTG maintained higher precision across a wide range of recall values, particularly under imbalanced conditions, highlighting its robustness and superior ability to identify positive glycosylation sites. These results further confirm the strong discriminative capability and generalization performance of the proposed model.

FIG. 10.

Comparison of Precision–Recall curves among all models on the balanced and imbalanced test sets, highlighting the superior performance of the proposed parallel CNN357_Bi-LSTM (DP-OTG) model.

The strong recall and PR performance of DP-OTG on the imbalanced test set can be attributed to its parallel CNN–Bi-LSTM architecture, which effectively integrates multi-scale local feature extraction with long-range dependency modeling. Compared with individual CNN variants and the sequential CNN–Bi-LSTM baseline, DP-OTG achieves a more favorable recall–precision trade-off, enabling the detection of a larger proportion of true glycosylation sites without a substantial increase in false positive predictions.

Overall, these findings confirm that the DP-OTG model excels at capturing essential sequence features, making it a highly reliable approach for glycosylation site prediction across various data distributions.

To evaluate the generalization capability of our model on unseen data, we applied t-SNE to visualize the feature representations of two independent test sets: one with a balanced distribution of positive (label 1) and negative (label 0) samples, and another with an imbalanced distribution. As illustrated in Figure 11(A) and Figure 12(A), in the raw feature space, the positive and negative samples are largely intermixed, indicating limited inherent separability prior to modeling. However, after passing the test data through the trained model, the transformed feature spaces show clearly separated clusters corresponding to the two classes, as displayed in Figure 11(B) and Figure 12(B). This distinct class separation demonstrates the model’s ability to extract discriminative and generalizable representations, enabling accurate classification of glycosylation sites even in scenarios involving class imbalance.

FIG. 11.

t-SNE visualization of balance test data. (A) Raw feature space before applying the model. (B) Transformed feature space after passing through the trained model, showing improved class separability.

FIG. 12.

t-SNE visualization of imbalanced test data. (A) Raw feature space before applying the model. (B) Transformed feature space after passing through the trained model, showing improved class separability.

3.4. Performance comparison with previous existing predictors

To evaluate the effectiveness of our proposed DP-OTG model, we compared its performance against recent relevant prediction tools for human OTG using the same training dataset and an independent test dataset. The results are summarized in Table 5 and Figure 13, where both balanced and imbalanced test sets were used to assess each model’s robustness.

FIG. 13.

Comparison of DP-OTG and Existing Tools Based on MCC across Balanced and Imbalanced Tests.

Table 5.

Comparison of the Performance of the Proposed Model with Other Tools Using the Same Test Dataset

Model	Balanced test			Imbalance test
Model	SPE (%)	SEN (%)	ACC (%)	SPE (%)	SEN (%)	ACC (%)
GlycoMine (Li et al., 2015)	53.00	50.50	51.80	3.30	100.00	19.40
GlycoEP (Chauhan et al., 2013)	41.50	65.50	53.50	35.10	65.50	40.20
OGP (Huang et al., 2021)	68.50	84.00	76.30	68.60	84.00	71.20
NetOGlyc-4.0 (Steentoft et al., 2013)	84.50	79.50	82.00	83.20	79.00	82.50
O-GlyThr (Tang et al., 2023)	88.50	81.00	84.80	12.70	20.90	14.00
HOTGpred (Pham et al., 2024)	92.00	84.50	88.30	88.40	84.50	87.70
DOGpred (Lee et al., 2025)	91.50	85.50	88.50	89.60	84.10	88.80
DP-OTG	91.50	86.00	88.80	90.20	84.60	89.30

On the balanced test set, the DP-OTG achieves an ACC of 88.8%, an MCC of 0.776, and an AUC of 0.941, outperforming most existing models. While HOTGpred and DOGpred obtain slightly higher AUC values (0.943 and 0.947, respectively), DP-OTG maintains a better balance between sensitivity and specificity. Importantly, our model does not rely on handcrafted feature engineering or computationally expensive pretrained PLMs, making it a more efficient alternative.

For the imbalanced test set, DP-OTG demonstrates superior performance, achieving an ACC of 89.3%, an MCC of 0.661, and an AUC of 0.932. This highlights its robustness under real-world conditions, where class distributions are often skewed. In contrast, models such as O-GlyThr and GlycoEP experience substantial drops in performance, with negative MCC values and reduced predictive stability. Even state-of-the-art predictors like HOTGpred and DOGpred show a decline in MCC, whereas DP-OTG maintains consistent performance.

Overall, these results confirm that DP-OTG provides a strong balance between accuracy, computational efficiency, and generalizability. Unlike existing tools that either suffer from performance degradation on imbalanced data or require extensive feature extraction, DP-OTG delivers competitive performance with a streamlined architecture. This makes it a promising choice for large-scale applications requiring reliable PTM site prediction.

4. CONCLUSION

In this study, we proposed DP-OTG, a feature-free deep learning model developed for the accurate prediction of human OTG sites. This innovative end-to-end architecture seamlessly integrates deep learning with NLP techniques, removing the reliance on handcrafted features. The model combines a five-layer 1D-CNN operating in parallel with three distinct kernel sizes and a Bi-LSTM network, enabling each component to autonomously learn a wide range of sequence features directly from raw input data. This design is further enhanced by an NLP-based word embedding technique, implemented in the Embedding layer with trainable = True, enabling the model to dynamically update the word embedding vectors during training through the backpropagation mechanism. This ensures that the embeddings are optimized specifically for the glycosylation prediction task, rather than relying on static, pretrained representations.

Extensive cross-validation and independent testing confirm that DP-OTG surpasses existing predictors across both balanced and imbalanced datasets. As shown in Table 5, our model achieves the highest accuracy among all compared methods, with an ACC of 88.8% and an MCC of 0.776 on the balanced test set, and an ACC of 89.3% and an MCC of 0.661 on the imbalanced test set. These results demonstrate that DP-OTG not only improves predictive accuracy but also maintains strong performance under imbalanced conditions, where many existing tools experience performance degradation.

Our findings highlight the effectiveness of hybrid deep learning combined with NLP techniques in PTM site prediction. By eliminating manual feature selection while achieving superior accuracy. The ability to update word embedding vectors during training further enhances feature extraction, enabling the model to learn task-specific sequence representations dynamically.

AUTHORS’ CONTRIBUTIONS

T.-X.T.: Writing—original draft, methodology, investigation, formal analysis, data curation, and conceptualization. N.-Q.-K.L.: Methodology, writing—original draft, validation, supervision, methodology, and investigation. D.-H.L.: Methodology, writing—review and editing, methodology, and conceptualization. V.-N.N.: Writing—original draft, writing—review and editing, supervision, methodology, funding acquisition, data curation, and conceptualization.

Footnotes

ACKNOWLEDGMENT

The authors would like to express the sincerest gratitude to all the anonymous reviewers for their comments and opinions on the article, which were of great help to the article.

AUTHOR DISCLOSURE STATEMENT

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

FUNDING INFORMATION

This work was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2023.49.

ETHICAL APPROVAL

This study does not involve human participants or animals and was therefore exempt from Institutional Review Board (IRB) approval.

References

Arey

. The role of glycosylation in receptor signaling. Glycosylation, 2012; 26:50262.

Chauhan

, Rao

, Raghava

. In silico platform for prediction of N-, O-and C-glycosites in eukaryotic protein sequences. PloS One, 2013; 8(6):e67008.

Crooks

, Hon

, Chandonia

J-M

, et al. WebLogo: A sequence logo generator. Genome Res, 2004; 14(6):1188–1190.

Haltiwanger

, Lowe

. Role of glycosylation in development. Annu Rev Biochem, 2004; 73:491–537.

Hart

. Glycosylation. Curr Opin Cell Biol, 1992; 4(6):1017–1023.

Huang

, Wu

, Zhang

, et al. OGP: A repository of experimentally characterized O-glycoproteins to facilitate studies on O-glycosylation. Genomics Proteomics Bioinformatics, 2021; 19(4):611–618.

Jayaprakash

, Surolia

. Role of glycosylation in nucleating protein folding and stability. Biochem J, 2017; 474(14):2333–2347.

Kao

, Nguyen

, Huang

, et al. SuccSite: Incorporating amino acid composition and informative k-spaced amino acid pairs to identify protein succinylation sites. Genomics Proteomics Bioinformatics, 2020; 18(2):208–219.

, C

, Mj

, et al. Comparative risk estimates of an expanded list of PAHs from community and source-influenced air sampling. Chemosphere, 2020; 253:126680.

10.

Lee

, Pham

, Min

, et al. DOGpred: A novel deep learning framework for accurate identification of human O-linked threonine glycosylation sites. J Mol Biol, 2025; 437(6):168977.

11.

, Li

, Wang

, et al. GlycoMine: A machine learning-based approach for predicting N-, C-and O-linked glycosylation in the human proteome. Bioinformatics, 2015; 31(9):1411–1419.

12.

Lis

, Sharon

. Protein glycosylation: Structural and functional aspects. Eur J Biochem, 1993; 218(1):1–27.

13.

Magalhães

, Duarte

, Reis

. The role of O-glycosylation in human disease. Mol Aspects Med, 2021; 79:100964.

14.

Nguyen

, Huang

, et al. A new scheme to characterize and identify protein ubiquitination sites. IEEE/ACM Trans Comput Biol Bioinform, 2017; 14(2):393–403.

15.

Nguyen

, Tran

, Nguyen

, et al. Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing. Methods, 2024; 232:65–71.

16.

Ofer

, Brandes

, Linial

. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J, 2021; 19:1750–1758.

17.

Oliveira-Ferrer

, Legler

, Milde-Langosch

. Role of protein glycosylation in cancer metastasis. Semin Cancer Biol, 2017; 44:141–152.

18.

Pakhrin

, Chauhan

, Khan

, et al. Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model. Bioinformatics, 2024; 40(11):btae643.

19.

Pham

, Zhang

, Rakkiyappan

, et al. HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach. Comput Biol Med, 2024; 179:108859.

20.

Reily

, Stewart

, Renfrow

, et al. Glycosylation in health and disease. Nat Rev Nephrol, 2019; 15(6):346–366.

21.

Schedin‐weiss

, Winblad

, Tjernberg

. The role of protein glycosylation in Alzheimer disease. Febs J, 2014; 281(1):46–62.

22.

Steentoft

, Vakhrushev

, Joshi

, et al. Precision mapping of the human O‐GalNAc glycoproteome through SimpleCell technology. Embo J, 2013; 32(10):1478–1488.

23.

Tang

, Tang

, Zhang

, et al. O-GlyThr: Prediction of human O-linked threonine glycosites using multi-feature fusion. Int J Biol Macromol, 2023; 242(Pt 2):124761.

24.

Tran

, Khanh Le

, Nguyen

. Integrating CNN and Bi-LSTM for protein succinylation sites prediction based on natural language processing technique. Comput Biol Med, 2025a;186:109664.

25.

Tran

, Nguyen

, Le

, et al. KD_MultiSucc: Incorporating multi-teacher knowledge distillation and word embeddings for cross-species prediction of protein succinylation sites. Biol Methods Protoc, 2025b;10(1):bpaf041.

26.

Tran

, Nguyen

, Khanh Le

. Incorporating natural language-based and sequence-based features to predict protein SUMOylation sites. In: The 12th Conference on Information Technology and Its Applications. CITA 2023. Lecture Notes in Networks and Systems, Vol. 734. Springer: Cham; 2023; pp. 74–88.

27.

UNIPROT CONSORTIUM. UniProt: The universal protein knowledgebase in 2025. Nucleic Acids Res, 2024; 53:D609–D617.

28.

Vacic

, Iakoucheva

, Radivojac

. Two sample logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformatics, 2006; 22(12):1536–1537.