LightMobileBert: A secondary lightweight model based on MobileBert

Abstract

MobileBert is a generic lightweight model suffering from a large network depth and parameter cardinality. Therefore, this paper proposes a secondary lightweight model entitled LightMobileBert, which retains the bottom 12 Transformers structure of the pre-trained MobileBert and utilizes the tensor decomposition technique to process the model to skip pre-training and further reduce the parameters. At the same time, the joint loss function is constructed based on the improved Supervised Contrastive Learning loss function and the Cross-Entropy loss function to improve performance and stability. Finally, the LMBert_Adam optimizer, an improved Bert_Adam optimizer, is used to optimize the model. The experimental results demonstrate that LightMobileBert has a comparatively higher performance than MobileBert and other popular models while requiring 57% fewer network parameters than MobileBert, confirming that LightMobileBert retains a higher performance while being lightweight.

Keywords

Natural language processing lightweight model tensor decomposition supervised contrastive learning

1 Introduction

Large-scale pre-trained language models (PTMs) can capture the general language knowledge from some large language corpora and have become the backbone model for many natural language processing (NLP) tasks. Especially BERT [1] and its variants [2, 3] have been proven very effective. However, these models typically suffer from an enormous parameter cardinality, slow response speed, and difficulty in deploying on hardware-constrained edge devices. Therefore, lightweight NLP models have emerged.

Currently, popular lightweight methods for PTMs include model pruning [4 –7], model quantization [8 –11], low-rank decomposition, parameters sharing [12 –14] and knowledge distillation [15 –19]. Model pruning considers tailoring parts that have an insignificant impact on the model’s performance. Model quantification properly deals with the model’s precision. The process of decomposing a large tensor into several smaller tensors is called low-rank decomposition. In addition, parameter sharing is usually involved in this process. Finally, knowledge distillation refers to transferring knowledge from the teacher model to the student model. The abovementioned methods aim to obtain small model network parameters, a highly efficient reasoning time, and a relatively appealing performance.

Although these methods have achieved noticeable results by compressing the original models, current research suggests that comparatively few methods achieve secondary lightweight models based on the existing lightweight models. Considering that MobileBert [19] is more general than other lightweight models (described in Section 2.4), we developed a secondary lightweight model based on MobileBert that attains an appealing performance with a more visible lightweight. The first step involves halving the Transformers structure of MobileBert to retain the bottom structures as pruned MobileBert. In the second step, the tensor decomposition technology application solves the mismatch between the pre-trained MobileBert network and the LightMobileBert model. Specifically, step two is divided into the following two phases. In the first, the pruned MobileBert is used to load the pre-trained MobileBert network, and in the second, the Candecamp Parafac (CP) tensor decomposition scheme is used to process the structure of the pruned MobileBert. These operations optimize the model’s structure, avoid training the model from scratch, and save pre-training resources. In the third step, the Supervised Contrastive Learning [20 –22] (SCL) loss function is migrated to NLP from the image processing domain, and its improvement is combined with the Cross-Entropy (CE) loss function to construct a joint loss function for fitting-training the LightMobileBert model. Therefore, the distance between the same category is closer, and the distance between different categories is longer, so the model has better performance and stability. Finally, the LMBert_Adam optimizer based on Bert_Adam [23] is proposed to improve and stabilize our model’s performance. Extensive experiments on the seven classification corpora of GLUE [24] verify that although the proposed LightMobileBert has fewer parameters than other models, it maintains a relatively high performance.

This paper conducts some work in secondary lightweight models, which applies deep learning models to edge devices. Overall, the contributions of this paper are as follows:

We propose a secondary lightweight model entitled LightMobileBert that is based on MobileBert. LightMobileBert is one of the rare secondary lightweight models in the field of NLP, with its success representing a further reduction in the threshold of deep learning deployment on resource-constrained edge devices and secondary innovation about lightweight models.

Specifically, our proposed method has the following innovations. First, the size of the MobileBert model is reduced directly by reducing the Transformers layer. Second, the tensor decomposition technique used to process the model realizes the change in the internal structure without requiring a pre-training process. Third, the SCL loss function is extended and improved from the image process domain and combined with the CE loss function to enhance our model’s performance and stability. Finally, the LMBert_Adam optimizer based on Bert_Adam is proposed to improve and stabilize the model’s performance on different corpora.

Compared with MobileBert, the number of network parameters of LightMobileBert is 57% the MobileBert’s and only 14.47M. In the experiments involving seven corpora of GLUE, the average performance is improved by 2.06% compared to MobileBert. At the same time, our model attains higher performance and is more lightweight than other lightweight models, proving the effectiveness of the LightMobileBert model. The code is available at https://github.com/DeguangChen/LightMobileBert.

2 Related work

This section reviews the most related works in NLP Lightweight modeling. Primarily, this paper is based on MobileBert conducting secondary lightweight research (MobileBert is a general and lightweight model based on BERT). Therefore, this section makes the necessary introduction to the representative lightweight models and the applicable technologies.

2.1 Model pruning

As a naive lightweight method, pruning has a wide range of applications in NLP. In general, model pruning is the tailoring of some parts that have a limited impact on model performance.

Compressing BERT [4] is dedicated to exploring the influence of weight pruning in BERT’s pre-training stage on the performance of subsequent tasks. The model discusses three different level pruning strategies and draws the corresponding conclusions. Precisely, the Reweighted Proximal Pruning [5] (RPP) for BERT has been proposed, with the experimental results revealing that proximal pruning maintains high performance for both the pre-trained tasks and subsequent multiple fine-tuned tasks. At the same time, the model can be deployed on some edge devices. The LayerDrop model [6], a form of a structured dropout, has a regularization effect during training and permits pruning at inference time. Experimental results reveal that it is possible to select subnetworks of some depth from a large network without fine-tuning them while posing a limited impact on performance. The BERT-OF-THESEUS model [7] adopts the Theseus concept to perform an inter-layer replacement of Transformers, decreasing the parameters while avoiding pre-training and substantially saving computational power. However, our model and existing methods have verified that similar effects are possible by directly exploiting the bottom layers of the Transformers.

Model pruning is a simple but practical method that can reduce the model size and accelerate model convergence without significantly affecting the model’s performance. At present, this kind of method is relatively mature and has many applications in the NLP domain.

2.2 Model quantization

As an essential part of lightweight NLP, quantization is commonly an effective method. As a general rule, model quantization aims to handle the model’s parameters’ precision properly.

For instance, the Q8Bert model [8] quantizes the General Matrix Multiply (GMM) in the BERT’s fully connected and embedded layers. Simultaneously, quantization-aware training is executed in subsequent tasks, so the model parameters are one-fourth of the BERT model while minimizing the performance penalty. Similar to this method is Q-Bert [9]. Moreover, the TernaryBert model [10] splits the quantization on the Bert model into weight layer and activation layer quantization. In this work, to terrorize the BERT weight, the author uses both the approximation-based ternary weight networks [25] (TWN) and the loss-aware transition [26] (LAT). Therefore, the method has achieved noticeable compression results. FQ-BERT [11] fully quantizes the BERT model and quantifies the weights, activations, softmax, layer normalization, and all the intermediate results of the BERT model. Experiments have demonstrated that the FQ-BERT model achieves noticeable compression for weights with negligible performance loss.

The principle of model pruning is to reduce the model’s parameters to compress the model. However, model quantization compresses the model with low precision instead of high precision. However, in essence, the model’s parameter cardinality is not reduced. Currently, mixed precision methods are popular to deal with lightweight models, which is essentially model quantification.

2.3 Low-rank decomposition, parameter sharing

Due to the difficult knowledge of matrix decomposition, low-rank decomposition is seldom used in lightweight NLP. However, almost all low-rank decomposition models about NLP have high popularity. In short, decomposing a large tensor into several small tensors is called low-rank decomposition, and parameter sharing is usually involved in this process.

Among the successful models, Albert [12] is a model in terms of low-rank decomposition and parameter sharing. The model adopts word vector decomposition techniques, sentence-order prediction techniques, and several good-quality corpora to train, considerably reducing the number of model parameters while improving its performance relative to BERT. At the same time, the parameter sharing technology significantly reduces the number of model parameters. However, this method’s computational complexity is not reduced. The Y-Tuning model [13] also adapts frozen large-scale PTMs to specific downstream tasks. Without tuning the features of input text and model parameters, the model is both parameter-efficient and training-efficient. Moreover, the YOCO-BERT model [14] constructed an enormous search space that covers almost all configurations in the BERT model. Then, a novel stochastic nature gradient optimization method guided the generation of optimal candidate architectures, which balanced exploration and exploitation.

Low-rank decomposition is supported by appropriate mathematical knowledge, but pruning and quantification relatively lack this theory. However, finding a suitable mathematical formula is challenging, which may be one reason why low-rank decomposition is less used.

2.4 Knowledge distillation

Distillation is the most widely used method in model compression, which is the transfer of knowledge from the teacher model to the student model.

The Distilbert model [15] compressed BERT’s 12 Transformers layers to six, sacrificing 3% performance in exchange for 40% parameter compression. However, compared with the current popular distillation models, the parameter cardinality is comparatively large. Tinybert [16] adopted the two-stage training method and calculated the loss function between the teacher and the student models in multiple intermediate processes that aligned them as much as possible to facilitate the knowledge transfer from the teacher model to the student model. At the same time, the corpora have been considerably enhanced, and thus the model has made more significant progress in both lightweight size and performance. However, TinyBert has many hyper-parameters increasing its adjustment complexity. Moreover, it is not fair to compare with models such as BERT after applying the corpora enhancement technique. Similar models are the Simplified TinyBert [17] and the CatBert [18].

The Mobilebert model [19] is relatively standard in the current distillation field. This model uses the same depth as Bert_large (24 layers) and is a thin version of Bert_large while equipped with bottleneck structures and a designed balance between Multi-Head Attention (MHA) and Feed-Forward Networks (FFN). A specially designed teacher model, namely an inverted-bottleneck incorporated Bert_large model (IB-Bert), is trained first. Then the knowledge is transferred from IB-Bert to MobileBert. The most prominent feature of the IB-Bert model is that the linear mechanism, namely the inverted bottleneck, is added to the basic Transformer to increase the dimension of the corresponding network structure (because the deep and narrow model is difficult to train). In this way, the IB-Bert model obtains a high performance. To achieve a lightweight structure, Mobilebert has a linear mechanism to reduce the corresponding network’s dimension. A problem introduced by the linear mechanism structure of MobileBert is that the balance between the MHA and the FFN modules is broken. To solve this problem, the stacked feed-forward networks are used in MobileBERT to rebalance the relative size between MHA and FFN. Each MobileBert’s Transformer layer contains one MHA but several stacked FFN. Thus, the teacher model can smoothly transfer knowledge to the student model and successfully compress it.

In the conventional distillation compression process, the student models are always trained based on the mature teacher model, limiting the student model’s performance to a certain extent. MobileBert puts knowledge distillation first and then designs the teacher and student models. Therefore, the student model of this method is more versatile and has a higher performance. For the above reasons, this paper chooses the MobileBert as the basic model for innovation.

3 LightMobileBert model

The LightMobileBert model comprises Embedding, Transformers, joint loss function optimization, and classification layers. Among them, the Transformer layer comprises linear mapping, multi-head attention, some feed-forward neural network, and multiple regularization layers. The LightMobileBert model structure is illustrated in Fig. 1, and since it inherits the structure of MobileBert, it inherits the structure of BERT. Hence, the structure of LightMobileBert does not need to be introduced. We only introduce the improved and optimized parts of the LightMobilebert model, namely the layers reduction, tensor decomposition, joint loss functions, and optimizer, highlighted in Fig. 1 in red color.

Fig. 1

LightMobileBert model structure.

3.1 Layer reduction operation on transformers

The main innovation of the BERT-OF-THESEUS model is that every two or three consecutive Transformer layers are randomly reserved, and then the downstream tasks are fitted. This method is relatively complex, and obtaining inconsistent corpora feature information results in a particular performance penalty. Nonetheless, this method brings some inspiration to our model.

Considering the pruning strategy presented in Section 2.1 and the main innovation of the BERT-OF-THESEUS model, the LightMobileBert model is improved. As already known, in image processing and NLP, the bottom model structures obtain more general feature information, while the high-level model structures obtain more specific features. Based on this, the LightMobileBert intercepts the bottom 12 Transformers of MobileBert as the basic model one. At the same time, using the idea of the BERT-OF-THESEUS model, i.e. every two consecutive transformer layers of MobileBert are randomly reserved as the basic model two. The experiments reveal that the performance of the two basic models is roughly the same. Therefore, according to the simple principle, this paper selects the basic model one to be the basic LightMobileBert.

Through the above operations, the basic structure of LightMobileBert is constructed. However, this basic structure has many parameters and poor performance compared to other well-performing models. Therefore, it is necessary to optimize the basic LightMobileBert further.

3.2 Tensor decomposition

Since models such as BERT and MobileBert are two-stage models and require very high-performance hardware equipment during pre-training, it is difficult for general research institutions to pay the high pre-trained costs. Therefore, for research utilizing BERT and its related models, most researchers do not alter the structure of the pre-trained parts and only deal with the subsequent tasks to improve the model’s performance.

LightMobileBert is a secondary lightweight model based on MobileBert. To effectively use the existing pre-trained parameters of MobileBert and the basic LightMobileBert, after the operations presented in Section 3.1, we use the Candecamp Parafac (CP) decomposition scheme for tensor decomposition to process the network’s structure. This strategy optimizes the model’s structure, avoids training the model from scratch, and largely saves pre-training resources.

CP decomposition is a special case of Tucker decomposition. The main difference is that there is a core tensor after the Tucker decomposition, while there is no core tensor after CP decomposition. CP decomposition decomposes a large tensor into a series of unit tensors. For a tensor χ of size n₁×n₂×n₃, as illustrated in Fig. 2, the CP decomposition formula is: $χ \approx \sum_{r = 1}^{R} A (:, r) \otimes B (:, r) C (:, r)$ (1) where the sizes of matrices A, B, and C are n₁×R, n₂×R, and n₃×R, respectively, which are called factor matrices. The symbol ⊗ represents the tensor product.

Fig. 2

CP decomposition form.

The Self-Attention mechanism is the most pivotal part of Transformer about the basic LightMobileBert model. The parameters account for a large proportion of the basic LightMobileBert model. Hence, compressing the key parts can better reflect the effectiveness of CP decomposition. Therefore, CP decomposition is used to process the Self-Attention in the basic LightMobileBert. All tensors with a size of [–1,128] are processed as [–1,64]. Compared with MobileBert, after the operations presented in Sections 3.1 and 3.2, the LightMobileBert becomes more lightweight while saving the pre-training costs and as LightMobileBert_basic. However, the model’s performance is reduced further because it is not trained from scratch and is lighter. Therefore, the model’s performance must be improved.

3.3 Joint loss function

The same source images indicate multiple images formed by an image after different types of transformation, such as rotation, clipping, and affine transformation. The same category images indicate a group of images with the same label. Self-supervised Contrastive Learning [20] only distinguishes the same source images but can not distinguish the same category images. Therefore, the image process has proposed the Supervised Contrastive Learning [21, 22] (SCL) loss function. The SCL method incorporates labels to bring the image features of the same category close to each other and decline the image features of different categories as far away as possible to achieve better classification results. The SCL formula is: $\begin{matrix} ς^{s c l} = \sum_{i = 1}^{2 N} ς_{i}^{s c l} \\ ς_{i}^{s c l} = \frac{- 1}{2 N \bar{y_{i}} - 1} \sum_{j}^{2 N} 1 i \neq j \cdot 1 {\tilde{y}}_{i} = {\tilde{y}}_{j} \cdot \log \\ \frac{\exp (z_{i} \cdot z_{j} / τ)}{\sum_{k = 1}^{2 N} 1 i \neq k • \exp (z_{i} \cdot z_{k} / τ)} \end{matrix}$ (2) where N represents the number of pictures contained in each MiniBatch, 2 N is the result of performing data augmentation processing on N pictures twice, ${\tilde{y}}_{i}$ and ${\tilde{y}}_{j}$ are the label values, z_i, z_j, z_k are the extracted feature values, and τ is a constant greater than zero.

Formula (2) reveals that SCL requires an augmented dataset. Hence, we used Glove [27] and Easy Data Augmentation technologies [28] to augment the corpora for the experiments but found that the performance of our model was not significantly improved. In addition, it is unfair to compare the model trained with augmented corpora with the models without augmentation technologies. So, in this paper, all original corpora are kept unchanged, and we change formula (2) to make it applicable for the NLP task. The improved SCL loss function is given by:

$\begin{array}{l} ς_{S C L} = - \sum_{i = 1}^{N} \frac{1}{N_{{\tilde{y}}_{i}} - 1} \sum_{j}^{N} 1 i \neq j \cdot 1_{{\tilde{y}}_{i} = {\tilde{y}}_{j}} \cdot \log \\ \frac{\exp (z_{i} \cdot z_{j} / τ)}{α \sum_{k = 1}^{N} 1 i \neq k • \exp (z_{i} \cdot z_{k} / τ)} \end{array}$ (3) where N denotes the batch size, ${\tilde{y}}_{i}$ , ${\tilde{y}}_{j}$ are the labels, and z_i, z_j are the extracted feature values. Since no augmentation processing is performed, a selected sample exp(z_i · z_j/τ) may not have a sample consistent with its label in the batch. In addition, due to the restriction of category labels ( $1_{{\tilde{y}}_{i} = {\tilde{y}}_{j}}$ ), the sample that exp(z_i · z_j/τ) is equal to $\sum_{k = 1}^{N} 1_{i \neq k} • exp (z_{i} \cdot z_{k} / τ)$ occurs, i.e., log(1) occurs. To prevent this, the denominator is multiplied by the adjustment coefficient α (α>1).

The CE loss function (ς_CE) is a loss function commonly used in neural network classification models. Due to its good performance, it is widely used. It is formulated as: $ς_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} • log p_{i, c}$ (4) where N represents the batch size, C is the number of classifications, y denotes the true labels, and p represents the predicted labels. In this paper, the CE loss function (ς_CE) and the improved SCL loss function (ς_SCL) are added to obtain the joint loss function (loss): $loss = ς_{CE} + ς_{SCL}$ (5)

The joint loss function is more aggregated intra-class and dispersed inter-class than the CE loss function. The comparison of their classification effects is illustrated in Fig. 3.

Fig. 3

The classification effects comparison.

3.4 LMBert_Adam optimizer

The optimizer has a significant impact on the model’s performance. Tianyi et al. [29] argued that one of the main factors of the BERT model instability is that the Bert_Adam optimizer omits the offset correction term compared to the standard Adam optimizer. When the corpus is smaller, the model’s performance will suffer some losses and large fluctuations.

Based on the above findings rthis paper proposes the LMBert_Adam optimizer and emphasizes its innovation ri.e. rdifferent aspects from Bert_Adam. Specifically rfirst rwe restore the optimizer to the standard Adam optimizer (the red parts in Table 1 are the omitted parts of the Bert_Adam optimizer). This way rthe model’s performance is stabilized and improved on the small corpora. Furthermore ran additional γ attenuation coefficient is set on the standard Adam optimizer to act on the current gradient θ (the purple part in the 11th row). It should be noted that adding γ is convenient for the optimizer to fine-tune based on parameter adjustment and obtain better performance to a certain extent. The LMBert_Adamoptimization algorithm is presented in Table 1.

Table 1
LMBert_Adam optimization algorithm

Algorithm 1: Adam pseudocode adapted from Kingma and Ba [23]. g_t² denotes the elementwise square g_t – g_t. β₁ and β₂ to the power t are denoted as $β_{1}^{t}$ $β_{2}^{t}$ . All operations on vectors are element-wise. The suggested hyperparameter values according to Kingma and Ba are: α=0.001, β₁ = 0.9, β₂ = 0.999, ɛ=10^-8 and γ=0.02. BERTADAM omits the bias correction (lines 9 - 10), and treats m_t and v_t as $\underset{t}{\hat{m}}$ and $\underset{t}{\hat{v}}$ in line 11. ADAM omits the γ •θ_t - 1 in line 11.

Require:α: learning rate; β₁, β₂ ∈ [0, 1): exponential decay rates for the moment estimates; f(θ): stochastic objective

function with parameters θ; θ₀: initial parameter vector;λ ∈ 0 r1): decoupled weight decay.

1: m₀ ← 0 (Initialize first moment vector)

2: v₀ ← 0 (Initialize second moment vector)

3: t ← 0 (Initialize timestep)

4: whileθ _t not converged do

5: t ← t + 1

6: g_t ← ∇_θf_t (θ _t _- 1)(Get gradients w.r.t stochastic objective at timestep t)

7: m_t ← β₁ •m_t_- 1+(1 - β₁) •g_t (Update biased first moment estimate)

8: v_t ← β₂ •v_t_- 1+(1 - β₂) • $g_{t}^{2}$ (Update biased second raw moment estimate)

9: $\underset{t}{\hat{m}}$ ← m_t / (1- $β_{1}^{t}$ ) (Compute bias-corrected first moment estimate)

10: $\underset{t}{\hat{v}}$ ← v_t / (1- $β_{2}^{t}$ ) (Compute bias-corrected second raw moment estimate)

11: θ _t ← θ _t _- 1 - α •( $\underset{t}{\hat{m}}$ / ( $\sqrt{\underset{t}{\hat{v}}} + ɛ$ )+γ•θ _t _- 1) (Update parameters)

12: end while

13: return θ _t (Resulting parameters)

Algorithm 1: Adam pseudocode adapted from Kingma and Ba [23]. g_t² denotes the elementwise square g_t – g_t. β₁ and β₂ to the power t are denoted as $β_{1}^{t}$ $β_{2}^{t}$ . All operations on vectors are element-wise. The suggested hyperparameter values according to Kingma and Ba are: α=0.001, β₁ = 0.9, β₂ = 0.999, ɛ=10^-8 and γ=0.02. BERTADAM omits the bias correction (lines 9 - 10), and treats m_t and v_t as $\underset{t}{\hat{m}}$ and $\underset{t}{\hat{v}}$ in line 11. ADAM omits the γ •θ_t - 1 in line 11.
Require:α: learning rate; β₁, β₂ ∈ [0, 1): exponential decay rates for the moment estimates; f(θ): stochastic objective
function with parameters θ; θ₀: initial parameter vector;λ ∈ 0 r1): decoupled weight decay.
1: m₀ ← 0 (Initialize first moment vector)
2: v₀ ← 0 (Initialize second moment vector)
3: t ← 0 (Initialize timestep)
4: whileθ _t not converged do
5: t ← t + 1
6: g_t ← ∇_θf_t (θ _t _- 1)(Get gradients w.r.t stochastic objective at timestep t)
7: m_t ← β₁ •m_t_- 1+(1 - β₁) •g_t (Update biased first moment estimate)
8: v_t ← β₂ •v_t_- 1+(1 - β₂) • $g_{t}^{2}$ (Update biased second raw moment estimate)
9: $\underset{t}{\hat{m}}$ ← m_t / (1- $β_{1}^{t}$ ) (Compute bias-corrected first moment estimate)
10: $\underset{t}{\hat{v}}$ ← v_t / (1- $β_{2}^{t}$ ) (Compute bias-corrected second raw moment estimate)
11: θ _t ← θ _t _- 1 - α •( $\underset{t}{\hat{m}}$ / ( $\sqrt{\underset{t}{\hat{v}}} + ɛ$ )+γ•θ _t _- 1) (Update parameters)
12: end while
13: return θ _t (Resulting parameters)

4 Experiments and analyses

This section introduces the experimental environments and corpora and conducts detailed experiments and analyses. These experiments prove that LightMobileBert has reduced the number of parameters and achieves relatively higher performance than other lightweight models.

4.1 Experimental environment

The experiments are conducted on two systems using a Windows 10 platform. One of the systems uses an Inter (R) Xeon (R) Gold 6154 CPU with 256GB memory ran Nvidia TITAN V graphics card rPython version 3.6.8 rand Torch version 1.1.1. On this system rwe mainly train the large corpora. The second system has an Inter (R) Xeon (R) E5-1620 v3 3.5 GHz CPU with 8GB memory ran NVIDIA RTX 2080 graphic card rPython version 3.6.8 rand Torch version 1.1. 1. This system is used for training small corpora.

4.2 Corpora composition

Based on the above findings, this paper proposes the LMBert_Adam optimizer and emphasizes its innovation, i.e., different aspects from Bert_Adam. Specifically, first, we restore the optimizer to the standard Adam optimizer. This way, the model’s performance is stabilized and improved on the small corpora. Furthermore, an additional γ attenuation coefficient is set on the standard Adam optimizer to act on the current gradient θ. It should be noted that adding γ is convenient for the optimizer to fine-tune based on parameter adjustment and obtain better performance to a certain extent. The LMBert_Adamoptimization algorithm is presented in Table 1.

Following BERT and MobileBert, we do not consider the controversial WNLI corpus. We utilize accuracy as the metric for SST-2, MNLI, QNLI, QQP, RTE, MNLI and MNLI-mm. CoLA is evaluated on Matthew’s correlation, and in terms of MRPC, the F1 is evaluated. In this paper, if not explicitly mentioned, the experimental data is the default development set of GLUE.

Table 2
GLUE classification corpora

Corpora Train Task Domain Evaluating Indicator Classes

RTE 2.5K textual entailment news, wikipedia accuracy 2

MRPC 3.7k paraphrase news accuracy and F1 2

CoLA 8.5k linguistic correctness misc. matthews corrcoef 2

SST-2 67k entiment analysis movie reviews accuracy 2

QNLI 105k textual entailment wikipedia accuracy 2

QQP 364k paraphrase online QA accuracy and F1 2

MNLI 393k textual entailment misc. accuracy 3

Corpora	Train	Task	Domain	Evaluating Indicator	Classes
RTE	2.5K	textual entailment	news, wikipedia	accuracy	2
MRPC	3.7k	paraphrase	news	accuracy and F1	2
CoLA	8.5k	linguistic correctness	misc.	matthews corrcoef	2
SST-2	67k	entiment analysis	movie reviews	accuracy	2
QNLI	105k	textual entailment	wikipedia	accuracy	2
QQP	364k	paraphrase	online QA	accuracy and F1	2
MNLI	393k	textual entailment	misc.	accuracy	3

4.3 Effect of batch size on model performance

Due to the significant difference in the size of each corpus, setting a uniform batch size is not conducive to the performance of the LightMobileBert model. Therefore, this section explores the optimal batch size corresponding to the different corpora.

A fixed initial learning rate and training batch are set for each corpus. The fixed learning rate remains unchanged for the first third of the total training steps. For the final two-thirds of the training steps,the learning rate decreases linearly and finally decreases to one-third of the initial learning rate: $l r_{step} = {\begin{matrix} \begin{matrix} lr & (0 < step ⩽ \frac{1}{3} total) \end{matrix} \\ \begin{matrix} 1 - \frac{lr}{total} • step & (\frac{1}{3} total < step ⩽ total) \end{matrix} \end{matrix}$ (6) where lr is the initial learning rate, total is the total number of training steps, step is the current number of training steps, and lr_step is the learning rate under the current number of steps. The relevant hyper-parameters are presented in Table 3, and the test performance of each corpus under training in different batch sizes is recorded (Fig. 4).

Table 3

Setting of hyper-parameters corresponding to various corpora in the model

Corpora	Initial learning rate	Train batch size	Batch size
RTE	7×10^-5	100	4, 6, 8, 16, 32, 64, 96
MRPC	7×10^-5	50	4, 6, 8, 16, 32, 64
CoLA	8×10^-5	50	8, 16, 32, 64, 96, 120
SST-2	8×10^-5	15	16, 32, 64, 96, 120
QNLI	8×10^-5	12	16, 32, 64, 96, 120
QQP	8×10^-5	5	16, 32, 64, 96, 120
MNLI	8×10^-5	5	16, 32, 64, 96, 120

Fig.4

Effects of batch sizes on model performance.

Figure 4 highlights that: (1) LightMobileBert obtains the best performance with different batch sizes on different corpora. This is caused by the difference in the size of the corpora and the difficulty of the corpora itself. (2) The smaller the corpus size, the smaller the optimal batch size required. It is proved that under the same model conditions, the relatively small batch size of a relatively small corpus can promote the model to learn subtler corpus characteristics, which is convenient for the optimization of the neural network model. (3). As the size of the corpora increases, the influence of the batch size on model performance gradually decreases to a specific range. This is because the larger the corpora, after a certain period of training, the characteristics of acquired corpora improved (no over-fitting and under-fitting problems), thereby the effects of different batch sizes on model performance become smaller. The experiment mainly demonstrated that the smaller the corpora, the greater the impact of the change in batch size on model performance.

4.4 LightMobileBert is compared with existing models

This section compares the test results obtained under the best batch size of Section 4.3 against the performance of currently popular lightweight models. The corresponding results are reported in Table 4.

Table 4
Performance of different models on GLUE

Models #Params RTE MRPC CoLA SST-2 QNLI QQP MNLI Average

Bert_Base [1] 110M 71.1 89.5 54.3 92.7 91.2 89.8 82.2(average) 81.7

DistillBert [15] 66.4M 59.9 87.5 51.3 92.7 89.2 88.5 82.2/- –

TinyBert [16] (Test set) 14.5M 66.6 86.4 44.1 92.6 87.7 71.3 82.5/81.8 76.63

BARTen-yT [13] 17M 62.8 79.2 44.4 94.4 88.2 85.5 81.6/83.0 77.39

BERT-of-Theseus [7] 67M 68.2 89.0 51.1 91.5 89.5 89.6 82.3(average) 80.20

YOCO-Bert [14]

Searched_A 20M-40M 65.0 81.2 15.2 84.3 68.9 88.8 71.8(average) 68.38

Searched_B 40M-60M 69.3 88.5 55.6 92.1 85.1 89.9 81.7(average) 80.49

MobileBert [19] 25.3M 70.4 88.8 51.1 92.6 91.6 70.5 84.3/83.4 79.09

LightMobileBert_basic 14.47M 53.79 87.48 40.66 90.48 86.88 89.83 81.20/81.59 76.49

LightMobileBert 14.47M 70.04 90.05 54.42 91.51 88.16 90.31 82.31/82.43 81.15

Models	#Params	RTE	MRPC	CoLA	SST-2	QNLI	QQP	MNLI	Average
Bert_Base [1]	110M	71.1	89.5	54.3	92.7	91.2	89.8	82.2(average)	81.7
DistillBert [15]	66.4M	59.9	87.5	51.3	92.7	89.2	88.5	82.2/-	–
TinyBert [16] (Test set)	14.5M	66.6	86.4	44.1	92.6	87.7	71.3	82.5/81.8	76.63
BARTen-yT [13]	17M	62.8	79.2	44.4	94.4	88.2	85.5	81.6/83.0	77.39
BERT-of-Theseus [7]	67M	68.2	89.0	51.1	91.5	89.5	89.6	82.3(average)	80.20
YOCO-Bert [14]
Searched_A	20M-40M	65.0	81.2	15.2	84.3	68.9	88.8	71.8(average)	68.38
Searched_B	40M-60M	69.3	88.5	55.6	92.1	85.1	89.9	81.7(average)	80.49
MobileBert [19]	25.3M	70.4	88.8	51.1	92.6	91.6	70.5	84.3/83.4	79.09
LightMobileBert_basic	14.47M	53.79	87.48	40.66	90.48	86.88	89.83	81.20/81.59	76.49
LightMobileBert	14.47M	70.04	90.05	54.42	91.51	88.16	90.31	82.31/82.43	81.15

According to Table 4, (1) The average performance of the LightMobileBert model on the seven corpora is 2.06% higher than that of MobileBert, while the parameter is only 57% of MobileBert. It has been proven that the LightMobileBert model has apparent advantages over MobileBert. (2) The LightMobileBert model has almost the same number of parameters as the TinyBert and BARTen-yT models, but the average performance is improved by 4.52% and 3.76% over that of TinyBert and BARTen-yT. (3) Compared with other models with large parameters, such as YOCO-Bert (searched_B) and Bert-of-Theseus model, LightMobileBert has absolute advantages in model size and slight advantages in performance. (4) Compared with the LightMobileBert_basic model, the performance of LightMobileBert is improved by 4.66%, proving the optimization measures’ effectiveness. (5) Compared with the Bert_base model, the performance of our model is slightly short, but the network parameters are only 13.15% of that of the Bert_base model, which proves the effectiveness of the LightMobileBert model. Through the above experiments, the LightMobileBert model has proved to have higher accuracy, decreased number of parameters, and computational cost in contrast to other popular models.

4.5 Ablation experiments

The advantages of the LightMobileBert model over its competitor models have already been demonstrated. In this section, we investigate the effects of the loss function, the optimizer, and the CP decomposition on the performance of the LightMobileBert model. In addition, the idea of the layer reduction operation on transformers of Section 3.1 is similar to the BERT-OF-THESEUS model. Therefore, the ablation experiments of the layer reduction operation on transformers were neglected. For further details on the related ablation experiments, please refer to [7].

Loss function ablation experiments: The difference between the joint loss function and the CE loss function on the LightMobileBert model is explored through loss function ablation experiments. The optimal batch size and the corresponding hyper-parameters are used for each corpus. The only difference is whether to use the joint loss function or the CE loss function when fitting training and testing. Table 5 reports the highest performance of the LightMobileBert model under different loss functions per corpus.

Table 5
Loss function ablation experiments

Corpora RTE MRPC CoLA SST-2 QNLI QQP MNLI Average

Batch size 6 6 16 64 64 64 32/64 –

Performance Joint loss 70.04 90.05 54.52 91.51 88.16 90.31 82.31/82.43 81.15

CE 67.15 89.33 52.80 90.94 88.05 90.00 82.01/82.15 80.30

Corpora		RTE	MRPC	CoLA	SST-2	QNLI	QQP	MNLI	Average
Batch size		6	6	16	64	64	64	32/64	–
Performance	Joint loss	70.04	90.05	54.52	91.51	88.16	90.31	82.31/82.43	81.15
CE	67.15	89.33	52.80	90.94	88.05	90.00	82.01/82.15	80.30

According to Table 5, (1) On each corpus, the model’s performance using the joint loss function is better than that solely using the CE loss function. Additionally, the average performance of the joint loss function is 0.85% higher than the CE loss function, proving the importance of the joint loss function. (2) When the corpus is relatively small, the joint loss function can accelerate convergence and improve the model’s performance more than in a relatively large corpus. (3) When the corpus is relatively large, the improvement of the joint loss function on the model’s performance is relatively weak. Therefore, we use the joint loss function to train the LightMobileBert based on these conclusions.

Optimizer ablation experiments: Due to the existence of the bias correction and the fine-tuning terms in the LMBert_Adam optimizer, it is necessary to explore the similarities and differences between LMBert_Adam and Bert_Adam on the model impact. Hence, we employ the joint loss function to conduct this group of experiments. The only difference is whether to use the LMBert_Adam optimizer or the Bert_Adam optimizer when training and testing. The corresponding model performance of each corpus is presented in Table 6.

Table 6

Optimizer ablation experiments

Corpora		RTE	MRPC	CoLA	SST-2	QNLI	QQP	MNLI	Average
Batch size		6	6	16	64	64	64	32/64	–
Performance	LMBert_Adam	70.04	90.05	54.52	91.51	88.16	90.31	82.31/82.43	81.15
Performance	Bert_Adam	66.06	89.00	44.02	90.48	87.50	90.00	81.32/81.88	78.78

According to Table 6, (1) On each corpus, the model’s performance using the LMBert_Adam optimizer is better than solely using the Bert_Adam optimizer. Additionally, the average performance of the LMBert_Adam optimizer is 2.37% higher than the Bert_Adam optimizer, demonstrating the importance of the former optimizer. (2) The LMBert_Adam optimizer can more effectively improve the model’s performance on small corpora. (3) According to the corresponding comparison between Tables 5 and 6, it can be found that with the increase in the corpora size, the joint loss function and LMBert_Adam optimizer have similar enhancement effects on the LightMobilebert model. Hence, this paper applies the LMBert_Adam optimizer to achieve an appealing performance.

CP decomposition ablation experiments: CP decomposition on the Transformers layer can make the model lightweight. In order to evaluate the impact of CP decomposition, while using the joint loss function and the LMBert_Adam optimizer, we conducted the following experiments to discuss whether to employ CP decomposition technology. The corresponding model performance of each corpus is reported in Table 7.

Table 7

CP decomposition ablation experiments

Corpora		RTE	MRPC	CoLA	SST-2	QNLI	QQP	MNLI	Average
Batch size		6	6	16	64	64	64	32/64	–
Performance	Adopting CP decomposition	70.04	90.05	54.52	91.51	88.16	90.31	82.31/82.43	81.15
Performance	Without CP decomposition	72.20	90.47	55.21	91.17	89.75	90.44	82.93/83.16	81.92

According to Table 7, the average performance when neglecting CP decomposition is 0.77% higher than when using CP decomposition, proving that CP decomposition has a slightly negative influence on the performance of the LightMobileBert model. However, the CP decomposition technique can effectively compress the LightMobileBert model and avoid pre-training the model. Therefore, we use the CP decomposition in our method.

5 Conclusion

This paper proposes the LightMobileBert model, an extension of MobileBert, which utilizes the layer reduction operation on Transformers to reduce the model size. Then CP decomposition technology is adopted to reduce the model’s size further and avoid pre-training the model. After that, the improved SCL and the CE functions are used to construct a joint loss function to enhance the performance of the LightMobileBert model. Finally, the LMBert_Adam optimizer is constructed based on the Bert_Adam optimizer to improve the LightMobileBert model’s performance further. The experimental results on the seven classification corpora of GLUE demonstrate that compared with the MobileBert, LightMobileBert’s overall performance is improved, and the model size is significantly reduced. The experimental results prove the effectiveness of the proposed model, providing a theoretical basis and practical significance for the lightweight model to be deployed on edge devices.

However, there are still some problems to be solved. First, this paper only explores the classification problem and does not cover more complicated problems such as question-and-answer (QA). Second, compared with designing and pre-training a new model, the LightMobileBert model is processed directly by reducing the number of the transformers’ layers and relies on tensor decomposition technology, resulting in a certain extent of performance loss. For the first problem, since the improved SCL and CE loss functions are suitable for classification algorithms, it is very interesting to study the general loss function. For the second problem, designing and pre-training a new model requires a lot of hardware and software resources, which is challenging for general research institutions and individuals. Therefore, exploring a method to reduce the model size and its corresponding pre-trained network with a small performance loss is meaningful.

Footnotes

Acknowledgments

This research was supported through National Natural Science Foundation of China (No. 62273219, 62006149, 62003203, 62102239, 61862001); Natural Science Foundation of Shaanxi Province (No. 2021JM-206, 2021JQ-314); Fundamental Research Funds For the Central Universities (No.2021CSLY023, 2021TS035, GK202205038); Center for Applied Mathematics of Inner Mongolian (ZZYJZD2022003); the Shaanxi Key Science and Technology Innovation Team Project (No. 2022TD-26).

References

Devlin

, Chang

M.W.

, Lee

, Toutanova

Bert: pre-trained of deep bidirectional transformers for language understanding, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019), 4171–4186.

Chen

, Ma

, Wei

, Zhu

, Ma

, Gong

and Zhou

, A text-based multi-span network for reading comprehension, Journal of Intelligent & Fuzzy Systems 41 (2021), 5807–5819.

Chen

, Ma

, Wei

, Ma

and Zhu

, MTQA: Text-based multitype question and answer reading comprehension model, Computational Intelligence and Neuroscience 2021 (2021), 1–12.

Gordon

M.A.

, Duh

, Andrews

Compressing bert: Studying the effects of weight pruning on transfer learning, International Conference on Learning Representations (2020).

Guo

F.M.

, Liu

, Mungall

F.S.

, Lin

, Wang

Reweighted proximal pruning for large-scale language representation, International Conference on Learning Representations (2020).

Fan

, Grave

, Joulin

Reducing transformer depth on demand with structured dropout, arXiv preprint arXiv:1909.11556 (2019).

, Zhou

, Ge

, Wei

, Zhou

Bert-oftheseus: Compressing bert by progressive module replacing, Conference on Empirical Methods in Natural Language Processing(EMNLP) (2020), 7859–7869.

Zafrir

, Boudoukh

, Izsak

, Wasserblat

Q8bert: Quantized 8bit bert, In 2019 FifthWorkshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, 36-39.

Shen

, Dong

, Ye

, Ma

, Yao

, Gholami

, Keutzer

Q-bert: Hessian based ultra low precision quantization of bert, In Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 8815-8821.

10.

Zhang

, Hou

, Yin

, Shang

, Chen

, Jiang

, Liu

Ternarybert: Distillation-aware ultra-low bit bert, The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, 509-521.

11.

Liu

, Li

, Cheng

Hardware acceleration of fully quantized bert for efficient natural language processing, In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2021), 513–516.

12.

Lan

Z.Z.

, Chen

, Goodman

, Gimple

, Sharma

, Soricut

ALBERT:ALite Bert for Self-supervised Learning of Language Representations, International Conference on Learning Representations (2020), 1–17.

13.

Liu

, An

, Qiu

Y-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning, arXiv preprint arXiv:2202. 09817 (2022).

14.

Zhang

, Zheng

, Yang

, Li

, Wang

, Chao

, Ji

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient, arXiv preprint arXiv:2106.02435 (2021).

15.

Sanh

, Debut

, Chaumond

, Wolf

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, The 2020 Conference on Empirical Methods in Natural Language Processing (2020), 38–45.

16.

Jiao

, Yin

, Shang

, Jiang

, Chen

, Li

, Liu

Tinybert: Distilling bert for natural language understanding, The 2020 Conference on Empirical Methods in Natural Language Processing (2020), 2563–2576.

17.

Chen

, He

, Hui

, Sun

Simplified tinybert: Knowledge distillation for document retrieval, In European Conference on Information Retrieval (2021), 241–248.

18.

Lee

, Saxe

, Harang

CATBERT: Context-aware tiny BERT for detecting social engineering emails, arXiv preprint arXiv:2010.03484 (2020).

19.

Sun

, Yu

, Song

, Liu

, Yang

, Zhou

Mobilebert: a compact task-agnostic bert for resourcelimited devices, International Conference on Learning Representations (2020).

20.

Chen

, Kornblith

, Norouzi

, Hinton

A simple framework for contrastive learning of visual representations, In International Conference on Machine Learning(PMLR) (2020), 1597–1607.

21.

Gunel

, Du

, Conneau

, Stoyanov

Supervised contrastive learning for pre-trained language model finetuning, Conference and Workshop on Neural Information Processing Systems (2020).

22.

Khosla

, Teterwak

, Wang

, Sarna

, Tian

, Isola

and Krishnan

, Supervised contrastive learning, Advances in Neural Information Processing Systems 33 (2021), 18661–18673.

23.

Kingma

D.P.

, Ba

Adam: A method for stochastic optimization, International Conference on Learning Representations (2014), 1–15.

24.

Wang

, Singh

, Michael

, Hill

, Levy

, Bowman

S.R.

GLUE:Amulti-task benchmark and analysis platform for natural language understanding, In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2018), 353–355.

25.

, Zhang

, Liu

Ternary weight networks, arXiv preprint arXiv:1605.04711 (2016).

26.

Hou

, Kwok

J.T.

Loss-aware weight quantization of deep networks, In International Conference on Learning Representations (2018).

27.

Pennington

, Socher

, Manning

C.D.

Glove: Global vectors for word representation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1532–1543..

28.

Wei

, Zou

Eda: Easy data augmentation techniques for boosting performance on text classification tasks, arXiv preprint arXiv:1901.11196 (2019).

29.

Zhang

, Wu

, Katiyar

, Weinberger

K.Q.

, Artzi

Revisiting few-sample BERT fine-tuning, The International Conference on Learning Representations (2021).

30.

Dagan

, Glickman

, Magnini

The pascal recognising textual entailment challenge, In Machine Learning Challenges Workshop (2005), 177–190.

31.

Haim

R.B.

, Dagan

, Dolan

, Ferro

, Giampiccolo

, Magnini

, Szpektor

The second pascal recognising textual entailment challenge, In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment (2006), 7.

32.

Dolan

, Brockett

Automatically constructing a corpus of sentential paraphrases, In Third International Workshop on Paraphrasing (IWP2005) (2005).

33.

Warstadt

, Singh

and Bowman

S.R.

, Neural network acceptability judgments, Transactions of the Association for Computational Linguistics 7 (2019), 625–641.

34.

Dolan

, Brockett

Automatically constructing a corpus of sentential paraphrases, In Third International Workshop on Paraphrasing (IWP2005) (2005).

35.

Levesque

, Davis

, Morgenstern

The winograd schema challenge, In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (2012).

36.

Rajpurkar

, Zhang

, Lopyrev

, Liang

SQuAD: 100,000+questions for machine comprehension of text, Conference on Empirical Methods in Natural Language Processing (2016), 2383–2392.

37.

Chen

, Zhang

, Zhao

Quora question pairs, University of Waterloo (2018), 1–7.

38.

Williams

, Nangia

, Bowman

S.R.

A broadcoverage challenge corpus for sentence understanding through inference, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Long Papers (2018).

LightMobileBert: A secondary lightweight model based on MobileBert

Abstract

Keywords

1 Introduction

2 Related work

2.1 Model pruning

2.2 Model quantization

2.3 Low-rank decomposition, parameter sharing

2.4 Knowledge distillation

3 LightMobileBert model

3.2 Tensor decomposition

4.1 Experimental environment

4.2 Corpora composition

Table 5 Loss function ablation experiments Corpora RTE MRPC CoLA SST-2 QNLI QQP MNLI Average Batch size 6 6 16 64 64 64 32/64 – Performance Joint loss 70.04 90.05 54.52 91.51 88.16 90.31 82.31/82.43 81.15 CE 67.15 89.33 52.80 90.94 88.05 90.00 82.01/82.15 80.30

Footnotes

Acknowledgments

References

Table 5
Loss function ablation experiments

Corpora RTE MRPC CoLA SST-2 QNLI QQP MNLI Average

Batch size 6 6 16 64 64 64 32/64 –

Performance Joint loss 70.04 90.05 54.52 91.51 88.16 90.31 82.31/82.43 81.15

CE 67.15 89.33 52.80 90.94 88.05 90.00 82.01/82.15 80.30