Toward Robust Self-Training Paradigm for Molecular Prediction Tasks

Abstract

Molecular prediction tasks normally demand a series of professional experiments to label the target molecule, which suffers from the limited labeled data problem. One of the semisupervised learning paradigms, known as self-training, utilizes both labeled and unlabeled data. Specifically, a teacher model is trained using labeled data and produces pseudo labels for unlabeled data. These labeled and pseudo-labeled data are then jointly used to train a student model. However, the pseudo labels generated from the teacher model are generally not sufficiently accurate. Thus, we propose a robust self-training strategy by exploring robust loss function to handle such noisy labels in two paradigms, that is, generic and adaptive. We have conducted experiments on three molecular biology prediction tasks with four backbone models to gradually evaluate the performance of the proposed robust self-training strategy. The results demonstrate that the proposed method enhances prediction performance across all tasks, notably within molecular regression tasks, where there has been an average enhancement of 41.5%. Furthermore, the visualization analysis confirms the superiority of our method. Our proposed robust self-training is a simple yet effective strategy that efficiently improves molecular biology prediction performance. It tackles the labeled data insufficient issue in molecular biology by taking advantage of both labeled and unlabeled data. Moreover, it can be easily embedded with any prediction task, which serves as a universal approach for the bioinformatics community.

1. INTRODUCTION

Molecular prediction tasks are essential and fundamental within bioinformatics fields such as drug discovery (Ching et al., 2018; Guo et al., 2020a; Paul et al., 2010), which contain various molecule-relevant tasks, including molecular property prediction and protein secondary or tertiary structure prediction. With the advance of deep learning techniques, more and more studies address these tasks with various deep learning models (Guo et al., 2022; Heffernan et al., 2015; Ma et al., 2022a; Wang et al., 2016; Wu et al., 2018; Xu et al., 2018; Yang et al., 2019). These prediction tasks are well known as supervised problems, where labeled data serve as input and computational models are employed to predict the corresponding labels.

Many existing methods target such problems in this manner (Duvenaud et al., 2015; Gilmer et al., 2017; Guo et al., 2021; Ma et al., 2021; Ma et al., 2020a; Sønderby and Winther, 2014; Wu et al., 2018). However, one major challenge in molecular biology is that labeled data are limited and also difficult to obtain. The process usually involves a series of professional experiments, which can be both time-consuming and costly. Therefore, more paradigms have been developed to take advantage of unlabeled data to assist supervised learning, such as semisupervised learning (Hu et al., 2019; Mann and McCallum, 2007; Rong et al., 2020; Wang et al., 2019a; Zhang et al., 2018; Zhao et al., 2014). Within this field, a simple yet effective paradigm that leverages both unlabeled and labeled data, known as self-training, is rarely explored for molecular biology prediction tasks.

Self-training is generally established in four steps: (1) A teacher model is trained using labeled data; (2) the trained teacher model is subsequently utilized to generate pseudo labels for unlabeled data; (3) the labeled data and the pseudo-labeled data are combined to train a student model; and (4) the student model then becomes the teacher model to repeat steps 2–3 until the training is converged. This approach allows more data to be included in the training process and enables the student model to inherit from the teacher. This paradigm is easy to implement and effective to boost the training process. Self-training has been widely used in other areas and obtained promising performance, for example, Computer Vision (CV) (Babakhin et al., 2019; Xie et al., 2020; Zoph et al., 2020) and Nature Language Processing (NLP) (Cheng, 2019; He et al., 2019; Li et al., 2019).

Notably, it has achieved significant success in image classification area. Xie et al. (2020) propose a Noisy Student model, which constructs a larger student model by adding noise to force the student to learn more complex information. Touvron et al. (2021) explore the attention mechanism and develop a transformer-based model that follows the teacher-student flow. Meta-Pseudo Labels (Pham et al., 2021) proposes to dynamically update the teacher model based on the student performance for more effective learning from limited labeled data.

A key reason for the success is that not only are the unlabeled data enormous but also the size of labeled data is quite large, and so are the training models. Thus, the teacher model can sufficiently learn from the labeled data and achieve favorable performance, from which the student can then learn further. Take image classification as an example, the teacher model is trained on the ImageNet dataset, which contains over 14 million images with hundreds of millions of parameters. The prediction performance can achieve over 85% for top-1 accuracy and 95% for top-5 accuracy in a 1000-class prediction problem (Xie et al., 2020).

However, for most molecular biology prediction tasks, the size of the labeled dataset typically amounts to only a few thousand, with the corresponding prediction performance not as high as in image classification. Such scenarios lead to a problem: The generated pseudo labels may not be accurate. Such noisy labels may further bias student learning. Therefore, addressing label noise is the major concern when establishing self-training in the field of molecular biology.

One straightforward way to encourage the model to learn from the noisy labels is to design a loss with regularization to leverage the neural network learning. Mean absolute error (MAE) and cross-entropy (CE) loss are two commonly used loss functions in prediction tasks. MAE is typically employed in regression tasks and CE for classification tasks. While MAE has been theoretically proven to be robust to label noise during the training, the CE is not (Ghosh et al., 2017). Recently, robust loss functions have been studied to tackle the noisy label problem in classification tasks by generalizing MAE and CE, and have achieved impressive performance in image classification tasks (Englesson and Azizpour, 2021; Ghosh et al., 2017; Ma et al., 2020b; Wang et al., 2019b; Zhang and Sabuncu, 2018).

In this article, we leverage robust loss function and self-training to form a robust self-training framework for molecular biology prediction tasks in two paradigms, that is, generic and adaptive. Extensive experiments have been conducted over molecular regression and classification tasks, as well as protein secondary structure prediction (PSSP) tasks, to gradually evaluate the effectiveness of the proposed method.

Our contributions can be summarized as follows: (1) We propose a robust self-training paradigm that utilizes robust loss to constrain the student training; (2) the proposed framework is straightforward and easy to fit into any prediction task, which is a simple yet practical strategy to promote molecular prediction tasks; (3) the adaptive paradigm further specifies and enhances the training on unlabeled data; and (4) extensive experiments on various molecular prediction tasks demonstrate that self-training can improve the prediction performance by involving more unlabeled data, and the robust loss can further boost the performance by leveraging the label noise, as well as alleviating the overfitting during the training.

2. MATERIALS AND METHODS

2.1. Problem definition

Molecular biology prediction problems can be further referred to as regression problems or classification problems. Given a molecule $ℳ$ , the label that needs to be predicted is denoted as y, where $y \in R$ for a regression problem and $y \in {0, 1, \dots, K - 1}$ for a K-class classification problem. The input molecule $ℳ$ can be any format according to the task specifics, for example, protein sequence for PSSP, or molecular graph structure for molecular property prediction. In this study, we conduct three types of experiments to gradually demonstrate the effectiveness of the proposed robust self-training strategy: molecular property regression, molecular property classification, and PSSP. For the former two tasks, we employ the corresponding representative graph-based models, Equivariant Graph Neural Network (EGNN) (Satorras et al., 2021) and Graph Isomorphism Network (GIN) (Xu et al., 2018), as the backbone models to conduct extensive experiments.

Graph-based models generally take the molecular graph structure as the input, and generate a continuous vector $h_{g} \in R^{d_{g}}$ by learning from the graph structure and features, where d_g is the dimension of graph-based features. Then h _g is used to predict the value of y for molecule $ℳ$ . For the protein three-state secondary structure prediction, AWD-GRU (Cho et al., 2014) and EnsembleASP (Guo et al., 2020b) are used as the backbone models. The input protein sequence is fed into the models, and generates a feature vector $h_{s} \in R^{d_{s}}$ , where d_s is the dimension of sequence-based features. Next, h _s is utilized to predict the category of y for $ℳ$ , where $y \in \{0, 1, 2\}$ for protein three-state secondary structure.

2.2. Robust self-training overview

Our proposed robust self-training strategy is implemented on top of the self-training framework. Figure 1 illustrates the overall architecture, which can be viewed as two parts, train teacher and train student. First of all, a teacher model is trained on the labeled dataset $D_{l}$ , and a trained teacher model T is obtained.

FIG. 1.

An overview illustration of the robust self-training architecture. More details are described in Section 2.2.

After that, the student training process begins. It starts with generating the pseudo labels for the unlabeled dataset $D_{u}$ to construct a pseudo-labeled dataset $D_{p}$ . Then the student model is initialized with the teacher model, and trained on shuffled $D_{l}$ + $D_{p}$ . After training for several epochs, we consider that as one iteration, the best model during i-th iteration is selected as the best student model S_i, and then regard it as the new teacher model to repeat the previous steps. This process is repeated for i iterations until the student model is converged. The whole training process is detailed in Algorithm 1.

Moreover, we design an adaptive robust self-training paradigm to further enhance the student training, which is illustrated as the adaptive option in Figure 1 and Algorithm 1. Specifically, robust loss $ℒ_{r}$ is employed only for the unlabeled dataset $D_{u}$ , while the original loss $ℒ_{o}$ is calculated on the labeled dataset $D_{l}$ . In this manner, $ℒ_{o}$ on the labeled dataset $D_{l}$ acts as a restriction by adhering to the ground-truth information, while $ℒ_{r}$ on $D_{u}$ sorely controls the noisy pseudo labels. Specifically, the robust loss L_r is applied only to the unlabeled dataset D_u, as the labels for D_u are generated pseudo labels. The robust loss L_r is solely applied to them, while retaining the original CE loss for the labeled training data.

2.3. Molecular property prediction tasks

Molecular property prediction task can be considered as two parts in view of deep learning, which are molecular encoder model and prediction model. Molecular encoder model generates a vector that represents the input molecule, and the prediction model takes the vector to make a prediction. In our experiments, the backbone models of the regression and classification task are graph based. We give a universal definition here for the graph-based encoder and the prediction model.

Molecule $ℳ$ can be naturally represented as a graph $G = (V, ℰ)$ , where $| V | = p$ refers to the set of p atoms and $| ℰ | = q$ refers to a set of q bonds in the molecule. The features of atom v is referred as $a_{v} \in R^{d_{a}}$ , and the features of bond $(v, u)$ is referred as $b_{v u} \in R^{d_{b}}$ , where $R^{d_{a}}$ and $R^{d_{b}}$ represent the feature dimension of atom and bond, respectively. $N (v)$ represents the neighbor atoms of atom v, which is identified by the connected bonds.

GNN-based models generally perform a message passing and state update protocol. In specific, the state of the target atom v at step d ( $d = 0, 1, \dots, D$ ) is updated by aggregating the information of its neighborhood h _u ( $u \in N (v)$ ) along with the state of itself h _v , where D represents the D-th hop neighbor of v. After D steps, the states of all the atoms are captured to generate a vector representation $h_{G}$ through a readout mechanism. The process can be formulated as follows:

h_{N (v)}^{d + 1} = A G G R E G A T E (\{h_{u}^{d}, \forall u \in N (v)\}),

(1)

h_{v}^{d + 1} = σ (W^{d + 1} \cdot C O N C A T (h_{v}^{d}, h_{N (v)}^{d})),

(2)

h_{G} = R E A D O U T (\{h_{v}^{D + 1} | v \in V\}),

(3)

where $W^{d + 1}$ are the learnable weights, and $σ$ is the activation function. The readout operation can be summation or mean.

After going through the graph encoder model, the graph representation $h_{G}$ is then fed into the prediction model to make a prediction of the property. The prediction model is generally a simple neural network such as multilayer perceptron (MLP): $ŷ = M L P (h_{G}),$ (4)

where $ŷ$ is the output of the prediction model, which refers to the predicted probability for the classification tasks or the actual predicted property value for the regression tasks. Next, each backbone model is introduced along with the employed robust loss, respectively.

2.3.1. Regression task

MAE loss is theoretically proved to be robust (Ghosh et al., 2017). Therefore, we first conduct experiments on molecular regression tasks with MAE as the loss function. EGNN (Satorras et al., 2021) is one of the most recent works to address such problems. Other than the commonly used message-passing process based on the graph structure and features, EGNN further explores the geometric information by considering the atom coordinates $x^{d} = \{x_{0}^{d}, \dots, x_{p - 1}^{d}\}$ . The message update for layer d is defined as follows: $m_{v u}^{d} = ϕ_{e} (h_{v}^{d}, h_{u}^{d}, {|| x_{v}^{d} - x_{u}^{d}| |}^{2}, e_{v u}),$ (5)

h_{N (v)}^{d + 1} = A G G R E G A T E (\{m_{v u}^{d}, \forall u \in N (v)\}),

(7)

where $x_{v}^{d}$ and $x_{u}^{d}$ are the coordinates of atom v and its neighbor atom u at d-th step, vu represents the bond between them, $e_{v u}$ denotes the bond features, $ϕ_{e}$ and $ϕ_{x}$ are two output operations, and C equals $1 ∕ (p - 1)$ . Then the update follows the same protocol as Equations (2) and (3).

MAE is used as the robust loss function to constrain the network training with regard to noisy labels, which is defined as follows:

where M is the size of the dataset.

2.3.2. Classification task

Classification tasks are dominating for molecular property prediction problems as well. We then conduct experiments over classification tasks to evaluate the effectiveness of the self-training paradigm. However, the commonly used CE loss is not robust, so we employ the generalized cross-entropy (GCE) loss (Zhang and Sabuncu, 2018) to boost the self-training. In particular, GCE loss is employed to replace the CE loss during the student training phase, where the unlabeled data are predicted with pseudo labels.

Furthermore, we extend this strategy to an adaptive version, where the GCE loss is calculated and back-propagated solely for the unlabeled data that have the generated pseudo labels from the teacher model. The backbone model utilized for this task is GIN (Xu et al., 2018). GIN is theoretically proved as one of the most powerful GNN models. It utilizes MLP for state update, and employs a concatenate operation over all passing steps during the readout phase. The updated rule in Equations (2) and (3) can be summarized as follows:

$h_{G} = C O N C A T (R E A D O U T (\{h_{v}^{d + 1} | v \in V\})),$ (10)

where $U$ is a fixed scalar or a learnable parameter.

GCE loss is a generalized version of CE and MAE. The CE loss is defined by

for a K-class classification problem (K = 2 for binary classification), where y^k is the one-hot encoding label, and $ŷ^{k}$ denotes the probability output from the prediction network. Allow $f_{k} (x) = ŷ^{k}$ , GCE loss is designed by $ℒ_{G C E} = \frac{(1 - f_{k} {(x)}^{q})}{q}, w h e r e q \in (0, 1] .$ (12)

GCE loss is reduced to CE loss and MAE loss when $q \to 0$ and $q = 1$ , respectively. Detailed proofs can be found in Zhang and Sabuncu (2018).

2.4. PSSP tasks

To further prove the universality and effectiveness of our method, we conduct experiments with two state-of-the-art PSSP methods as the backbone models to evaluate proposed robust self-training strategy.

AWD-GRU (Moffat and Jones, 2021) is one recent work that utilizes Gated Recurrent Unit (GRU) for PSSP. It exploits three layers of GRUs, where each layer GRU with DropConnect (Wan et al., 2013) is applied to the hidden-to-hidden weight matrices.

EnsembleASP (Guo et al., 2020b) adopts CNN-based networks to capture local context features and bidirectional-LSTM (bidirectional-Long Short-Term Memory) for long-term dependencies. Moreover, the ASP network (a modified version of Atrous Spatial Pyramid Pooling network) (Chen et al., 2018) further improves the prediction performance by identifying the boundary of successive amino acids that have the same secondary structure. In this work, we use single protein sequence without profile information as the input.

Both the backbone models address the three-state secondary structure prediction problem, which forms a three-class classification problem. The robust loss exploited here follows the same implementation as in Equations (11) and (12) to make our experiments consistent:

where K equals to 3, and $q \in (0, 1]$ .

3. RESULTS

Extensive experiments are conducted gradually to evaluate the effectiveness of the proposed robust self-training strategy. We first implement self-training on molecular regression tasks with MAE loss as the robust loss function. Then we explore GCE loss on the molecular classification tasks. Finally, we establish self-training with GCE loss on a large-scale protein secondary structure dataset to further demonstrate the superiority of the proposed method. Source code will be released soon after cleaning up.

3.1. Dataset description and setup

3.1.1. Molecular property datasets

QM9 (Ramakrishnan et al., 2014) is a standard benchmark for molecular property regression problems. It is a subset of GDB-17 database (Ruddigkeit et al., 2012), which contains 134k molecules. It comprehensively provides 12 quantum chemical properties for each molecule, including geometric, energetic, electronic, and thermodynamic. HIV is introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen. It contains the test result of 41,127 molecule compounds with the ability for inhibiting HIV replication. The widely used version provided by MoleculeNet (Wu et al., 2018) contains inactive labels and active labels, which makes it a binary classification task.

For all molecular property prediction tasks, we randomly select 50% of the data as the unlabeled dataset, and the rest is used as the labeled dataset with a 3:1:1 training/validation/test ratio. We do not use an external unlabeled dataset here since most molecules may not express target property at all, which may lead to a biased comparison. Therefore, we further conduct experiments on PSSP since every protein contains the secondary structure labels.

3.1.2. Protein secondary structure datasets

For PSSP, three datasets are used in total: labeled dataset, testing dataset, and unlabeled dataset.

Labeled dataset construction contains two steps. First, the protein sequence is generated by Pisces server (Wang and Dunbrack, 2003) with a maximum consistency of 70% between structures, a maximum resolution of 2.6A (Moffat and Jones, 2021), and a length not exceeding 700. The resulting labeled dataset contains 25,503 protein sequences, and is split into training and validation datasets with a 95:5 ratio. Next, the corresponding 3-state secondary structure labels are generated by DSSP (Kabsch and Sander, 1983).

Testing dataset utilized in the experiments is not selected from the labeled dataset, since the widely used dataset to test PSSP is CB513 (Cuff and Barton, 1999). Thus, we follow the standard protocol to evaluate the prediction performance on CB513, which contains 513 proteins generated from Zhou and Troyanskaya (2014).

Unlabeled dataset is generated using Uniclust30 (Mirdita et al., 2017), which consists of UniProtKB (Uniprot Consortium, 2019) sequences clustered to 30% identity, and the length is less or equal to 700. We remove those protein sequences that share homology information with the labeled dataset, and then use the remaining 201,408 protein sequences as our unlabeled dataset.

To prevent data leakage, we perform strict screening of the super homologous family information. CATH (Sillitoe et al., 2021) is used for homology assessment, where any sequence overlapped with the testing dataset (CB513) at the superfamily level is removed from both unlabeled dataset and labeled dataset by cross-referencing.

It is noteworthy that for each task, the validation and test datasets are fixed after construction, and remain the same for all the comparison experiments. The unlabeled dataset is only involved in the training phase for the self-training procedure.

3.2. Experimental details

3.2.1. Baselines

For all the experiments, we consider training solely on the labeled datasets as the fundamental baselines, which is denoted as “-labeled.” Then, we establish our vanilla implementation by running experiments with self-training paradigm on both labeled dataset and unlabeled dataset, denoted as “-self-training.” Finally, we integrate robust loss with our vanilla self-training benchmark to demonstrate the superiority of our robust self-training, denoted as “-robust.” For the further enhanced adaptive paradigm with both original loss and robust loss, the experiments are denoted as “-adaptive-robust.” Note that the adaptive option is only established on the classification tasks since MAE loss is sorely used in the regression tasks. For molecular property prediction, since the unlabeled dataset is formed by randomly selecting 50% from the original labeled dataset, we also compare the performance when using the original backbone model without self-training on all the data with labels, denoted as”-all.”

Overall, extensive experiments are conducted over three tasks with four backbone models: EGNN for molecular regression task, GIN for molecular classification task, and AWD-GRU and EnsembleASP for PSSP task.

3.2.2. Evaluation metric

We follow the commonly used evaluation metric for each task. In specific, MAE is used as the evaluation criteria for molecular regression tasks on QM9; area under the receiver operating characteristic curve (ROC-AUC) and Precision-Recall Area Under Curve (PRC-AUC) are used for molecular classification tasks; Q3 accuracy is used for protein three-state secondary structure prediction (Moffat and Jones, 2021).

3.2.3. Configurations

We follow the original implementation and settings of the backbone models, and implement robust self-training on top of them. All the hyperparameters of the backbone models remain the same to ensure a fair comparison. For the settings of robust self-training, we perform three iterations for the student training, and tune the hyperparameter q when employing GCE loss. For molecular classification task, we run the experiments three times to alleviate the randomness since HIV dataset is much smaller than other datasets, leading to relatively unstable performance. We take the average and standard deviation of the evaluation scores as the final results.

For molecular regression and PSSP tasks, we follow the original configurations and evaluations to run the experiments one time. The results do not vary much since the training data are sufficiently large and the converged stage is stable.

3.2.4. Training strategy

We follow the same procedure for all three tasks. First, we train a teacher model on the labeled data, and use it to generate pseudo labels for the unlabeled dataset. Next, for the vanilla self-training, we train the student model, which takes the teacher model as the initialization on the combined labeled and pseudo-labeled dataset. Note that the pseudo-labeled dataset is only merged into the training dataset along with the labeled training dataset.

The validation and test datasets remain the same from the teacher model training. Furthermore, we choose the best student model in the current iteration as the new teacher model to generate a new pseudo-labeled dataset and initialize the student model for the next iteration. We run the student training for three iterations, and take the best validation model to evaluate the test dataset performance. For robust self-training, the procedure is the same as vanilla self-training, except robust loss function is employed. Similar for adaptive robust self-training, where the robust loss is used for unlabeled data and the original loss is used for labeled data.

3.3. Experimental results

3.3.1. Molecular property prediction

Our first experiment is to employ the self-training paradigm directly on molecular regression tasks, since MAE is theoretically robust to noisy labels. The comparison results for each property are shown in Table 1. “EGNN-labeled” represents the baseline model, which is trained on the labeled datasets. “EGNN-self-training” denotes the experiments that perform self-training by including the unlabeled datasets.

Table 1.
Mean Absolute Error for Each Molecular Property Regression Benchmark on QM9 Dataset

Since the inherently used MAE is theoretically proven to be a robust loss, “EGNN-self-training” is equivalent to our proposed paradigm. “EGNN-all” represents the scenario of using the original backbone model, without self-training, on all the data with labels. As we can observe, the performance of the self-training strategy outperforms EGNN-label consistently by a 41.5% average improvement. Moreover, the performance is competitive against the supervised training on the all-labeled dataset. Our implementation achieves the best performance on 9/12 tasks compared with the original EGNN-all on all 134k labeled data, which gains the average MAE boost by 7.2%. The experiments on regression tasks with MAE sufficiently demonstrate that robust loss function is a perfect fit for self-training strategy by dealing with the generated pseudo labels.

We then conduct experiments on the HIV dataset to evaluate how self-training performs on classification tasks. As shown in Figure 2, the improvement of directly implementing self-training is limited, which is reasonable since CE loss is not theoretically robust (Ghosh et al., 2017). Therefore, we explore robust loss function GCE and integrate it with self-training to form the robust self-training paradigm, which further boosts the ROC-AUC to 0.822. The adaptive paradigm, which employs robust loss on unlabeled dataset, further improves the performance to 0.826. Moreover, our methods are competitive with the original GIN implementation on the all-labeled dataset with a 0.002–0.006 improvement. Note that in our self-training experiments, 50% of the dataset is treated as unlabeled, while GIN-all is trained on 100% labeled dataset.

FIG. 2.

ROC-AUC score for molecular property classification benchmark on HIV dataset. Higher score is better. ROC-AUC, area under the receiver operating characteristic curve.

To further evaluate the performance of our proposed robust self-training paradigm, we conduct experiments on more diverse datasets and also include the PRC-AUC score as an evaluation metric. The results are shown in Table 2. “GIN-labeled” represents training on the labeled dataset, which also serves as the teacher model. “Ours” represents the performance of our proposed adaptive robust self-training framework. “GIN-labeled-RL” denotes the teacher model trained with robust loss instead of the original CE loss.

Table 2.

Dataset Information and Comparison Experiments on HIV, Tox21, Toxcast, and MUV Datasets

Dataset	HIV		Tox21		Toxcast		MUV
Data size	41,127		7831		8575		93,087
No. of tasks	1		12		617		17
Metric	ROC	PRC	ROC	PRC	ROC	PRC	ROC	PRC
GIN-labeled	0.786 ± 0.008	0.357 ± 0.036	0.788 ± 0.002	0.319 ± 0.034	0.687 ± 0.008	0.339 ± 0.019	0.686 ± 0.043	0.047 ± 0.034
GIN-labeled-RL	0.782 ± 0.006	0.340 ± 0.009	0.684 ± 0.006	0.153 ± 0.002	0.570 ± 0.018	0.260 ± 0.005	0.581 ± 0.068	0.015 ± 0.008
Ours	0.826 ± 0.006	0.437 ± 0.017	0.813 ± 0.011	0.376 ± 0.026	0.740 ± 0.017	0.391 ± 0.008	0.729 ± 0.036	0.075 ± 0.038

ROC denotes the ROC-AUC score, and PRC is the PRC-AUC score. Higher value indicates better performance. Best scores are marked in bold.

GIN, Graph Isomorphism Network; PRC-AUC, precision-recall area under curve; RL, robust loss; ROC-AUC, area under the receiver operating characteristic curve.

As observed, our models consistently improve upon the teacher model in both ROC-AUC and PRC-AUC scores. The performance of “GIN-labeled-RL” is not as good, which is reasonable since the original dataset is labeled and does not include much noise. Therefore, the benefits of robust loss are not fully realized. In addition, Tox21, Toxcast, and MUV are multitask datasets from MoleculeNet (Wu et al., 2018), containing more than one property label with missing labels, which could introduce additional challenges during training.

We also compare and visualize the ROC-AUC scores from training, validation, and testing between GIN-self-training, GIN-robust, and GIN-adaptive-robust on HIV dataset, as shown in Figure 3. GIN-self-training represents the vanilla self-training, GIN-robust illustrates self-training with robust loss, and GIN-adaptive-robust denotes the adaptive paradigm. The teacher model is trained for 150 epochs, so is for the student training during each iteration.

FIG. 3.

ROC-AUC visualization of training, validation, and testing on HIV dataset. X-axis is the epoch number, and Y-axis is the ROC-AUC value. Dashed lines represent GIN-self-training, darker solid lines indicate GIN-robust, and lighter solid lines denote GIN-adaptive-robust. GIN, Graph Isomorphism Network.

The curves are overlapped for the first 150 epochs since we load the same teacher model to make a fair and explicit comparison of how the robust loss function promotes self-training. From the training curves, we can clearly observe re-training progress between every 150 epochs, and the ROC-AUC values of GIN-robust (blue lines) and GIN-adaptive-robust (orange lines) are much more stable than GIN-self-training (green dashed lines). Meanwhile, as shown in both training and validation curves, GIN-self-training gets easily overfitted compared with our robust self-training methods. Furthermore, the test curves demonstrate that with the help of robust loss, the student model can keep learning better during iteration training.

3.3.2. Protein secondary structure prediction

To mimic the real-world scenario, which means utilizing data that actually does not have a label, we conduct experiments over PSSP. Unlike molecular property prediction, where the external unlabeled dataset may not contain any molecules that express the target property, every protein molecule certainly contains secondary structure labels. Therefore, we use a large-scale external unlabeled dataset to run our robust self-training experiments. As shown in Figure 4, the prediction performance is as expected. PSSP-self-training slightly improves the performance by using self-training, and then along with robust loss, PSSP-robust further boosts the Q3 accuracy to 0.734 and 0.735 for the two backbone models, respectively.

FIG. 4.

Q3 accuracy for the PSSP task. Higher score is better. PSSP, protein secondary structure prediction.

It is noteworthy that the performance of PSSP-adaptive-robust is not as good as PSSP-robust. The reason is that for PSSP, the theoretical limit of accuracy has been estimated to be 92% (Dill and MacCallum, 2012; Ho et al., 2021) due to the potential existence of noise in the observation of secondary structures. Consequently, the labeled data used in our self-training framework may already contain noise.

We have conducted experiments that substitute the original CE loss on the PSSP-labeled dataset with robust loss, denoted as PSSP-labeled-robust. We can observe that the performance is competitive with PSSP-labeled, which indicates that the labeled data might include noisy labels. Therefore, directly utilizing robust loss for the entire dataset is the optimal strategy for such a scenario, as revealed by the comparison results. It is empirically demonstrated that the proposed robust self-training is a simple yet effective strategy for molecular biology prediction tasks by sufficiently utilizing both unlabeled and labeled data.

4. DISCUSSION

Extensive experiments gradually demonstrate the superiority of proposed robust self-training. First, the regression tasks associated with MAE loss have achieved remarkable improvement. Since MAE is theoretically proven to be robust, the proposed method that integrates robust loss with self-training has been proven effective. Next, we conduct more experiments using self-training with robust GCE loss to evaluate the performance on classification tasks.

The results confirm that self-training can slightly improve the prediction performance, and with the help of robust loss, the training becomes more stable and the performance is further boosted by a large margin. Moreover, considering that noisy labels in the real world are difficult to recognize, while real label data (labeled data) and data containing noisy labels (pseudo-labeled data) can be distinguished in the self-training process. We design the adaptive paradigm that conditionally employs robust loss, that is, keep the original classification loss for the labeled data, while utilizing robust loss for the pseudo-labeled data. The results further confirm the effectiveness of the robust loss in handling noisy labels.

Our study represents a preliminary step toward this robust self-training paradigm, within which several unexplored directions remain to be further studied. For example, self-training with MAE works great since the default MAE loss for regression tasks is robust. Nevertheless, for classification tasks, robust loss like GCE is a generalized version between CE and MAE, which can be considered a tradeoff between robust loss and traditional classification loss. Other forms of robust loss may be more suitable for different prediction tasks.

Moreover, the Alternative Convex Search (ACS) algorithm is introduced to optimize a truncated version of the GCE loss by balancing the weights and model parameters. Given that molecular prediction tasks often involve complex, multidimensional data with unique challenges such as label imbalance, multilabel outputs, and pseudo-labeling, such optimization may not yield optimal results in naive settings. Further and deeper studies can be explored by deploying more rigorous algorithms to adapt them to the specific demands of molecular prediction tasks.

5. CONCLUSIONS

In this study, we propose a robust self-training paradigm for molecular prediction tasks by exploring robust loss functions to constrain the self-training process. Extensive experiments over molecular regression, molecular classification, and PSSP tasks have demonstrated that self-training accompanied by robust loss can boost prediction performance by taking advantage of both labeled and unlabeled data. Moreover, our proposed robust self-training is model and task agnostic, which can be easily inserted into any molecular biology prediction task, and benefits the general computational molecular biology society.

Footnotes

ACKNOWLEDGMENT

This study is a major extension of our previous conference version (Ma et al., ), which was published as part of the ACM-BCB conference proceedings.

AUTHORs' CONTRIBUTIONS

H.M.: Conceptualization and methodology; F.J.: Formal analysis; Y.R.: Resources; Y.G.: Writing—review and editing; and J.H.: Supervision.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was partially supported by the Cancer Prevention and Research Institute of Texas (CPRIT) award (RP230363).

References

Babakhin

, Sanakoyeu

, Kitamura

Semi-Supervised Segmentation of Salt Bodies in Seismic Images Using an Ensemble of Convolutional Neural Networks. In: German Conference on Pattern Recognition. Springer, 2019; pp. 218–231.

Chen

L-C

, Zhu

, Papandreou

, et al. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018; pp. 801–818.

Cheng

Semi-Supervised Learning for Neural Machine Translation. In: Joint Training for Neural Machine Translation. Springer, 2019; pp. 25–40.

Ching

, Himmelstein

, Beaulieu-Jones

, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface, 2018; 15(141):20170387.

Cho

, Van Merriënboer

, Gulcehre

, et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

Cuff

, Barton

. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 1999; 34(4):508–519.

Dill

, MacCallum

. The protein-folding problem, 50 years on. Science, 2012; 338(6110):1042–1046.

Duvenaud

, Maclaurin

, Aguilera-Iparraguirre

, et al. Convolutional networks on graphs for learning molecular fingerprints. arXiv preprint arXiv:1509.09292, 2015.

Englesson

, Azizpour

Generalized Jensen-Shannon divergence loss for learning with noisy labels. arXiv preprint arXiv:2105.04522, 2021.

10.

Ghosh

, Kumar

, Sastry

. Robust Loss Functions Under Label Noise for Deep Neural Networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, Volume 31. 2017.

11.

Gilmer

, Schoenholz

, Riley

, et al. Neural Message Passing for Quantum Chemistry. In: International Conference on Machine Learning. PMLR, 2017.

12.

Guo

, Wu

, Ma

, et al. Bagging MSA Learning: Enhancing Low-Quality PSSM with Deep Learning for Accurate Protein Structure Property Prediction. In: International Conference on Research in Computational Molecular Biology. Springer, 2020a; pp. 88–103.

13.

Guo

, Wu

, Ma

, et al. Protein Ensemble Learning with Atrous Spatial Pyramid Networks for Secondary Structure Prediction. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020b; pp. 17–22.

14.

Guo

, Wu

, Ma

, et al. Comprehensive study on enhancing low-quality position-specific scoring matrix with deep learning for accurate protein structure property prediction: Using bagging multiple sequence alignment learning. J Comput Biol, 2021; 28(4):346–361.

15.

Guo

, Wu

, Ma

, et al. Self-Supervised Pre-Training for Protein Embeddings Using Tertiary Structures. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36. 2022; pp. 6801–6809.

16.

, Gu

, Shen

, et al. Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788, 2019.

17.

Heffernan

, Paliwal

, Lyons

, et al. Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning. Sci Rep, 2015; 5(1):1–11.

18.

C-T

, Huang

Y-W

, Chen

T-R

, et al. Discovering the ultimate limits of protein secondary structure prediction. Biomolecules, 2021; 11(11):1627.

19.

, Liu

, Gomes

, et al. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.

20.

Kabsch

, Sander

. DSSP: Definition of secondary structure of proteins given a set of 3D coordinates. Biopolymers, 1983; 22:2577–2637.

21.

, Sun

, Liu

, et al. Learning to self-train for semi-supervised few-shot classification. Adv Neural Inform Process Syst, 2019; 32:10276–10286.

22.

, An

, Wang

, et al. Deep graph learning with property augmentation for predicting drug-induced liver injury. Chem Res Toxicol, 2020a;34(2):495–506.

23.

, Bian

, Rong

, et al. Cross-dependent graph neural networks for molecular property prediction. Bioinformatics, 2022a;38(7):2003–2009.

24.

, Jiang

, Rong

, et al. Robust Self-Training Strategy for Various Molecular Biology Prediction Tasks. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2022b; pp. 1–5.

25.

, Rong

, Liu

, et al. Gradient-Norm Based Attentive Loss for Molecular Property Prediction. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2021; pp. 497–502.

26.

, Huang

, Wang

, et al. Normalized Loss Functions for Deep Learning with Noisy Labels. In: International Conference on Machine Learning. PMLR, 2020b; pp. 6543–6553.

27.

Mann

, McCallum

. Simple

Robust

, Scalable Semi-Supervised Learning via Expectation Regularization. In: Proceedings of the 24th International Conference on Machine Learning. 2007; pp. 593–600.

28.

Mirdita

, von den Driesch

, Galiez

, et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res, 2017; 45(D1):D170–D176.

29.

Moffat

, Jones

. Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework. Bioinformatics, 2021; 37(21):3744–3751.

30.

Paul

, Mytelka

, Dunwiddie

, et al. How to improve R&D productivity: The pharmaceutical industry's grand challenge. Nat Rev Drug Discov, 2010; 9(3):203.

31.

Pham

, Dai

, Xie

, et al. Meta Pseudo Labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021; pp. 11557–11568.

32.

Ramakrishnan

, Dral

, Rupp

, et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci Data, 2014; 1(1):1–7.

33.

Rong

, Bian

, Xu

, et al. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inform Process Syst, 2020; 33:12559–12571.

34.

Ruddigkeit

, Van Deursen

, Blum

, et al. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. J Chem Inform Model, 2012; 52(11):2864–2875.

35.

Satorras

, Hoogeboom

, Welling

E(n) Equivariant Graph Neural Networks. In: International Conference on Machine Learning. PMLR, 2021; pp. 9323–9332.

36.

Sillitoe

, Bordin

, Dawson

, et al. CATH: Increased structural coverage of functional space. Nucleic Acids Res, 2021; 49(D1):D266–D273.

37.

Sønderby

, Winther

Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828, 2014.

38.

Touvron

, Cord

, Douze

, et al. Training Data-Efficient Image Transformers & Distillation Through Attention. In: International Conference on Machine Learning. PMLR, 2021; pp. 10347–10357.

39.

Uniprot Consortium. Uniprot: A worldwide hub of protein knowledge. Nucleic Acids Res, 2019; 47(D1):D506–D515.

40.

Wan

, Zeiler

, Zhang

, et al. Regularization of Neural Networks Using Dropconnect. In: International Conference on Machine Learning. PMLR, 2013; pp. 1058–1066.

41.

Wang

, Dunbrack

Jr . Pisces: A protein sequence culling server. Bioinformatics, 2003; 19(12):1589–1591.

42.

Wang

, Guo

, Wang

, et al. SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019a; pp. 429–436.

43.

Wang

, Peng

, Ma

, et al. Protein secondary structure prediction using deep convolutional neural fields. Sci Rep, 2016; 6(1):1–11.

44.

Wang

, Ma

, Chen

, et al. Symmetric Cross Entropy for Robust Learning with Noisy Labels. In: Proceedings of the IEEECVF International Conference on Computer Vision. 2019b; pp. 322–330.

45.

, Ramsundar

, Feinberg

, et al. MoleculeNet: A benchmark for molecular machine learning. Chem Sci, 2018; 9(2):513–530.

46.

Xie

, Luong

M-T

, Hovy

, et al. Self-Training with Noisy Student Improves ImageNet Classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020; pp. 10687–10698.

47.

, Hu

, Leskovec

, et al. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.

48.

Yang

, Swanson

, Jin

, et al. Analyzing learned molecular representations for property prediction. J Chem Inform Model, 2019; 59(8):3370–3388.

49.

Zhang

, Wang

, Zhu

, et al. Seq3seq Fingerprint: Towards End-to-End Semi-Supervised Deep Drug Discovery. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018; pp. 404–413.

50.

Zhang

, Sabuncu

. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy

Labels

. In: 32nd Conference on Neural Information Processing Systems (NeurIPS). 2018.

51.

Zhao

, Han

, Shyu

C-R

, et al. Determining effects of non-synonymous snps on protein-protein interactions using supervised and semi-supervised learning. PLoS Comput Biol, 2014; 10(5):e1003592.

52.

Zhou

, Troyanskaya

. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. arXiv preprint arXiv:1403.1347, 2014.

53.

Zoph

, Ghiasi

, Lin

T-Y

, et al. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882, 2020.

Toward Robust Self-Training Paradigm for Molecular Prediction Tasks

Abstract

1. INTRODUCTION

2. MATERIALS AND METHODS

2.1. Problem definition

2.2. Robust self-training overview

3. RESULTS

3.1. Dataset description and setup

3.1.1. Molecular property datasets

3.1.2. Protein secondary structure datasets

3.2. Experimental details

3.2.1. Baselines

3.2.2. Evaluation metric

3.2.3. Configurations

3.2.4. Training strategy

3.3. Experimental results

3.3.1. Molecular property prediction

Table 1. Mean Absolute Error for Each Molecular Property Regression Benchmark on QM9 Dataset

5. CONCLUSIONS

Footnotes

ACKNOWLEDGMENT

AUTHORs' CONTRIBUTIONS

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

References

Table 1.
Mean Absolute Error for Each Molecular Property Regression Benchmark on QM9 Dataset