Abstract
Vision-based Continuous Sign Language Recognition (CSLR) is a challenging and weakly supervised task aimed at segmenting sign language from weakly annotated image stream sequences for recognition. Compared with Isolated Sign Language Recognition (ISLR), the biggest challenge of this work is that the image stream sequences have ambiguous time boundaries. Recent CSLR works have shown that the visual-level sign language recognition task focuses on image stream feature extraction and feature alignment, and overfitting is the most critical problem in the CSLR training process. After investigating the advanced CSLR models in recent years, we have identified that the key to this study is the adequate training of the feature extractor. Therefore, this paper proposes a CSLR model with Multi-state Feature Optimization (MFO), which is based on Fully Convolutional Network (FCN) and Connectionist Temporal Classification (CTC). The MFO mechanism supervises the multiple states of each Sign Gloss in the modeling process and provides more refined labels for training the CTC decoder, which can effectively solve the overfitting problem caused by training, while also significantly reducing the training cost in time. We validate the MFO method on the popular CSLR dataset and demonstrate that the model has better performance.
Keywords
Introduction
Sign language is a visual language consisting of gestures, body movements and facial expressions with certain rules of sign language grammar, which is the main way for deaf people to communicate [7, 32]. Currently, Gloss is used as basic unit of sign language in linguistic terms, and each Gloss consists of one or several gestures in succession [29].
Sign language recognition (SLR) is divided into isolated sign language recognition (ISLR) and continuous sign language recognition (CSLR), which convert sign language images or videos into annotation sequences [28]. ISLR has marked boundaries of significant gestures and annotations. In the contrast, CSLR only gives the temporal order of glossy sequences without any segmentation information, which further increases the difficulty of the task. At this stage, continuous sign language datasets are limited by production costs, and sentence-level annotations are usually used. In [14, 27], CSLR based on these datasets is defined as a weakly supervised problem.
To address these issues in CSLR tasks, researchers use deep learning networks to process video-based CSLR. These networks usually consist of three parts: a feature extraction module, a sequence learning module and an alignment module, which are designed like [33, 34]: the feature extraction module is used to extract visual features from image sequences or video frames and the sequence learning module will further mine the context information between consecutive sequences, extract relevant features and synthesize sign language Gloss sequences. However, when extracting context information, the sequence learning module may over-exploit the extracted visual features leading to overfitting training as the length of the Gloss sequence is not dynamically resolved. Therefore, the alignment module, which is introduced to supervise the Gloss sequence, search for possible alignment between the Gloss sequence and the corresponding labels to reduce the adverse effects of overfitting training. In addition to this representative design, a Visual Transformer (ViT) network based on Transformer is also popular to handle CSLR tasks, even though it was originally designed to process Natural Language Processing (NLP) tasks [15, 17]. Camgoz et al. [19] proposed an encoder-decoder network and Daras et al. [10] proposed a Generative Adversarial Network (GAN), both of which have significantly outperform the previous state-of-the-art performance based on Transformer. In addition, the key points of sign language are also regarded as an important direction of CSLR research. Bianco et al. [26] used human key points extracted from videos to conduct model training on LSA-T data set, which is the first Argentine continuous sign language data set. Image preprocessing operations are also widely used in feature enhancement to further improve the recognition performance of the model [1]. Kumar et al. also proposed a data encryption algorithm to improve the robustness of image data as well [16].
To train the designed model, Connectionist Temporal Classification (CTC) loss function is usually employed to search for the alignment between feature sequences and corresponding labels. However, the end-to-end network with deep learning may be inadequately trained or suffer from the adverse effect of overfitting during training, due to the highly complex structure of network [27, 34]. Some researches focus on solving the above problem. The Gloss Feature Enhancement (GFE) module [14] is proposed to further supervise the Gloss feature. Supervised training is also performed using pseudo-labels generated by iteratively [13, 27]. Optimizing with the CTC loss function alone will also introduce the spiking phenomenon [2, 8], which not only reduces the correlation between consecutive frames, but also leads the visual module loses the ability to discriminate other frames as a result of over-biasing few keyframes. In [33], the Visual Alignment Constraint (VAC) mechanism is proposed to solve the spike phenomenon.
However, even though the research on sign language recognition has achieved good results, there are still certain problems for CSLR task. Since CSLR task based on video sequence is characteristic of asymmetry between tag and sequence, recognizing a small amount of labels requires training through extensive data, which is extremely time-consuming and is difficult to ensure accurate recognition due to the characteristics of weak supervision as well. In this paper, we propose the Multi-state Feature Optimization (MFO) mechanism for CSLR task to address these problems, which not only significantly reduce the required time cost of training, but also strengthens supervision over the multiple states of Sign Glosses, thereby weakening the overfitting problem effect to achieve better performance.
The main contributions of this work are summarized as follows:
The model adopted is an end-to-end fully convolutional network for CSLR task, focusing on a single Sign Gloss rather than the entire sentence, which is further consistent with the semantic composition rules of sign language;
Compared with training methods such as iterative training and the introduction of additional supervision modules, the method we proposed is to strengthen the supervision of multiple states of Sign Glosses, which not only reduces the network scale, but also significantly improves the convergence speed and final performance of the model. To a certain extent, the contribution allocation problem of the end-to-end model is considered as well.
Related work
CSLR task is divided into two stages. The first stage is to represent the input image sequence, and the second stage is to learn the space-time dependencies between the feature representation sequences.
In the early research, the feature sequence was modeled based on the Hidden Markov Model (HMM) [11, 23, 31]. In reference [21], frame-level supervision is achieved by Graph Neural Network-Hidden Markov Model (GNN-HMM), and an iterative expectation maximization method is proposed to address CSLR task. A hybrid approach of HMM and Convolutional Neural Networks (CNN) is adopted to simulate Gloss transformation in [23, 25].
In the stage of rapid development of computer computing power, the role of deep learning network architecture has been greatly improved. The CNN-HMM network combined with Long Short-Term Memory (LSTM) is extended to a CNN-LSTM-HMM network architecture to perform more fine-grained processing of the SLR task as proposed in [20, 23]. Some researchers used LSTM to perform two-level feature refinement processing to enhance the learning of associations between sequences [14, 35]. The CTC loss optimization proposed by [3] has also become a hot topic in CSLR research in recent years. End-to-end network with CTC loss is used for sequence alignment to further optimize the performance of the model [13, 18, 27]. The attention network combined CNN with Transformer achieved state-of-the-art performance by using CTC loss optimization as well in [10, 19]. Some researchers gradually turn form 2D-CNN to 3D-CNN research for CSLR task. In reference [12], researchers proposed a 3D-CNN based Hierarchical Attention Network (HAN) for sign language recognition. The network architecture of 3D-CNN combined with LSTM and CTC loss is proposed in [30]. Yang et al. [35] even used a mixture of 2D-CNN and 3D-CNN as feature extractors.
Nevertheless, some research found that the CTC loss limits the discriminative ability of the end-to-end model feature extractor. Iterative training method is proposed to further optimize the model [13, 27]. A method of introducing additional optimization modules for supervision is proposed as well [14]. Furthermore, new alignment constraints to mitigate the spiking phenomenon of CTC loss, which also significantly improve the performance of the model [33].
Based on the above research, we find the essence of model optimization and propose a more efficient method to optimize the model. To a certain extent, the Multi-state Feature Optimization (MFO) model proposed in this paper for CSLR is a way to reflect the essence of model optimization. The primary and secondary relationship of module optimization is taken into account in this method, by a certain weight to jointly optimize multiple states of Sign Glosses, rather than just end-to-end optimization. The method enhances the overall performance of the model by optimizing states at different stages, and greatly cuts the time cost of training.
Methodology
In this section, the MFO model proposed in this paper is described in detail. The processing steps of the model is summarized as two steps: the first step is to extract visual features from the given RGB image stream sequence X ={ x1, x2, ⋯ , x T } with T frames; in the second part, the extracted visual features are converted into sign language sequences Y ={ y1, y2, ⋯ , y Z } with the target sequence length of Z, which is the mapping R : X → Y that is from image stream sequence X to target sequence Y.
Our proposed model performs more in-depth in two aspects. In feature extractor, we continue the design idea of the current CSLR research that divide visual feature extractor into frame feature extractor and Gloss feature encoder. The frame feature extractor performs spatial encoding on a single image, and the Gloss feature encoder fuses the spatial encoding features into Gloss features according to the semantic composition of sign language, which not only makes full use of visual features, but also strengthens the correlation between features. However, in addition to frame spatial feature processing, we also increase Glosses spatial feature extraction in Gloss feature encoder, compared with the research in [33, 34], which is described in detail in Section 3.2.
In terms of feature optimization, we introduce the MFO mechanism into the model, considering the contribution allocation problem of the end-to-end model. Unlike [14, 20, 24] which only focuses on the final state, the MFO mechanism not only pays attention to the final state of the end-to-end model, but also strengthens the supervision of each state in the model to reduce the adverse effect of different states on final result. The model framework used in this paper is shown in Fig. 1. The MFO mechanism is described in detail in Section 3.4.

Overview of the proposed framework. The network is fully convolutional and consists of two components: frame feature extractor module and Gloss feature encoder module. CTC decoder is used as network decoder to align the gloss sequence with the sign language sequence. The quantity change of each step of data is marked below it.
The frame feature extractor F aims to encode the spatial features of the input video frame sequences, which consists of the Convolutional Neural Networks for extracting the spatial features of the frames and Maximum Pooling layers for spatial features compression with residual blocks for extracting the context information of features. From the input shape, each group of image stream sequences is a tensor with shape (t, c, h, w), and the network process multiple groups of tensors each time, i.e., the tensor with shape (B * t, c, h, w), where t represents the number of frames, c represents the number of channels of the frame, h represents the length of the frame, w represents the width of the frame, and B represents the batch size of each network input. It should be noted that, before each input, we fill each group of image sequences to a fixed length E with a tensor whose elements are all 0 to meet the batch processing requirements of the model without changing the original data content, i.e., (t, c, h, w) → (E, c, h, w).
The frame feature extractor independently processes each frame for frame spatial feature learning without changing the number of original frames. The processing process is described as:
In our research, we found that the biggest problem of the CSLR task is the inability to determine the temporal boundaries to accurately divide the Sign Glosses. To solve this problem, we propose a Gloss feature encoder G to learn the associations between frame feature sequences. Gloss feature encoder G is divided into temporal encoder and spatial encoder. The temporal encoder constructs the Gloss feature sequence by learning the temporal correlation between different frames, and the spatial encoder enhances the Gloss feature through the context information between the Gloss sequences. The Gloss feature encoder is shown in Fig. 2.

Overview of the Gloss feature encoder module. N denotes the numbers of module.
The temporal encoder in this paper is composed of 1D-CNNs and max-pooling (MP) layers to operate on spatial features in the temporal dimension. Compared with the work in [23, 33], the temporal encoder we proposed only operates with 1D-CNNs in the temporal dimension without conventional temporal feature encoder, which is more lightweight in structure. After 1D-CNNs, a max-pooling layer is adopted to process the temporal features to change the temporal dimension appropriately.
Following the temporal encoder, the spatial encoder only contains a single 1D-CNN module to fuse the spatial context information between Gloss sequences without affecting the temporal dimension information. Noted that although logically the temporal encoder and the spatial encoder are directly connected, in the actual model processing process, the temporal encoder changes the shape of data tensor to (B * E, c), while the data input of the spatial encoder requires the tensor with shape (B, E, c).
Gloss feature encoder G is viewed as a sliding window of the frame space feature vector along the temporal dimension. In the sliding windows, the size is adjusted through the receptive field of 1D-CNNs and the sliding step makes adjustments by the step size of 1D-CNNs. The overall operation of the Gloss feature encoder is described as:
After encoding the Gloss features, our model introduces a CTC decoder D to automatically align the Gloss sequences with labels of sign language. The process takes into account all possible matching probabilities between the Gloss sequences and the labels, without the need for detailed annotation of the boundaries of the Gloss sequence.
All vocabularies in the Gloss sequence are present as V ={ v1, v2, ⋯ , v
u
} in CTC decoder. In addition, a blank tag, denoted as ′blank′, is introduced to V as well, which together constitute the final extended vocabulary list
In this paper, a fully connected layer (fc) is used after the Gloss feature encoder to transform the Gloss feature {g
i
} into g’
i
with u + 1 dimension, and the softmax activation function is used to transform the Gloss feature {g’} K×B into a prediction space {z} K×B. Finally, the CTC objective function is calculated to obtain a sign language sequence Y with a final length of Z. The process is expressed as:
Revisiting optimization methods in CSLR
In CSLR continuous sign language recognition, CTC function has shown excellent performance in aligning ground truth sequence as a loss function. However, the function forms a series of spike responses [3, 33], leading to that network tends to predict blank labels when it is uncertain about the boundary of Glosses. Therefore, the final prediction result consists of few non-blank keyframes and many high-confidence blank frames.
In order to weaken the effect of above phenomenon, the pseudo-labels generated by iterative training are adopted to correct the error caused by the spike responses [8, 13]. Additional decoding modules is introduced to supervise the CTC decoder [14, 34]. In [10], Generative Adversarial Network (GAN) is used to adjust the prediction results. The above methos achieve great performance by supervised optimization of the final state, but it also introduces few problems: new errors introduced by iterative training, the additional network size and training cost by adding decoding modules. Besides, the contribution allocation problem of the end-to-end model remains unimproved in the above methods, and even worsens. In view of the problems mentioned above, we propose the Multi-state Feature Optimization (MFO) mechanism to improve these.
Loss function design
The model proposed in this paper contains two feature processing modules: Frame feature extractor and Gloss feature encoder. In order to enhance the visual-spatial features, we propose to add an auxiliary classifier fc to each of the two parts of the visual feature state for joint supervision to optimize the local features. We set differentiated weights for different modules inspired by combining reviews and ratings with a weighted way in [16]. Therefore, the main loss function L
main
is defined as:
L1 and L2 represent two optimization functions for states of Frame feature extractor and Gloss feature encoder respectively, L4 is the final state optimization function. Considering that both process states are independent of the final state, optimizer is led to over-optimize a certain state. Therefore, we adopt the KLDiv function (L
KLD
) to supervise front state (s1) and rear state (s2) to keep the model robust, L
KLD
is treated as L3. The equation for the equilibration process is defined as:
To be compatible with CTC objective function adopted in final state, CTC objective function is taken as the optimization function in process states as well, that is, the CTC function is applied to L1, L2 and L4. Therefore, loss function design of the model is shown in Fig. 3, and the final loss function Loss is expressed as:

Loss design of multi-state feature optimization.
In summary, L main loss is used to monitor the front and back states of the model to make full use of visual features, final state of the network and state of Frame feature extractor is regarded as teacher and student to conduct knowledge distillation in L KLD loss, which corrects possible misalignment of two classifiers. With the help of both losses, the network enhances features through compatibility with multiple state information.
Dataset
RWTH-PHOENIX Weather 2014 (RWTH) [22] is a widely used CSLR dataset recorded daily news of German TV weather forecasts transcribed in sign language from 2009 to 2011. The data in this dataset only includes German sign language, which was performed by nine hearing Sign Language interpreters, all of which were video recorded at 25 FPS, and the resolution of the video frame was 210×260. In terms of data volume, it contains 1232 distinct words, totaling 6841 distinct sentences (about 80,000 words). The dataset is explicitly divided into 672 training samples, 540 validation samples, and 629 testing samples.
Evaluation metrics
To evaluate the performance of our model, we use the Word Error Rate (WER), commonly used in continuous sign language recognition, as the method of model performance evaluation. The calculation formula of WER is expressed as:
The main network setting used in our experiments is shown in Fig. 4.

The detailed setting of proposed network.
Frame feature extractor. Considering the frequent changes of sign language gesture, illumination, and background, which affects sign language recognition, Resnet18 is adopted as Frame feature extractor, as it obtains more effective context spatial feature information by residual blocks.
Gloss feature encoder. Gloss feature encoder, taking the acquisition frame rate of the dataset and the average number of frames contained in a single Gloss into account, is designed into two parts: the first part uses sliding windows to extract frame spatial features in temporal dimension to form Gloss sequences, by two structures of k3-s1-p2-m2; another part employs the attention mechanism to learn the correlation of spatial context, by the structure of k3-s1-p2-fc. Among them, k, s, and p represent the convolution kernel size, stride, and padding of CNN, respectively, m represents the maximum pooling, fc represents the full convolution layer, and the following numbers represent the corresponding parameter sizes.
We initially set the weight parameters in Eq. (10) to be 0.5, 0.5, and 5, respectively. The dataset is trained for 40 epochs with batch size set to 2. The model is optimized by Adam optimizer with an initial learning rate of 10 - 3, which turn into one-fifth of the original after every 20 epochs.
In this paper, we mainly focus on sign language recognition which the inputs are RGB video frames. Hence, we only compare our results with previous methods used solely RGB modality.
The proposed end-to-end model with L4 loss is viewed as the Baseline. We compare the experimental results on the RWTH with the previous vision-based methods listed in Table 1. The experimental data of the previous methods are all from their original papers. The model based on the MFO mechanism achieves the best results of 26.7% on the Dev set and 26.1% on the Test set. These experimental results also illustrate the effectiveness of the method proposed in this paper.
Result comparison on RWTH
Result comparison on RWTH
In addition to achieving great model performance, another purpose of our proposed method is to reduce the convergence time cost of network training while ensuring network performance. We refer to advanced sign language recognition methods, which seriously increase the convergence time cost of research while achieving the best performance, that is passive for the advancement of research. We compare some studies that have performed well in recent years with our method, as shown in Table 2.
Convergence time cost comparison for optimal performance on RWTH
These researches increase the convergence time cost by introducing additional models or adopting iterative training, even if the recognition accuracy is improved. To alleviate this problem, our MFO mechanism optimizes multiple states of the model each time, not just the final state, which has a great effect on the convergence of the model. For the same dataset, our model convergence time cost is even half of theirs with the equivalent performance.
Due to that the multi-state optimization cause unstable model performance, we conduct ablation experiments on the proposed MFO mechanism with different combinations of optimization functions to further explain the contribution allocation of the end-to-end model. The experimental results are shown in Table 3.
Ablation results of Multi-state Feature Optimization
Ablation results of Multi-state Feature Optimization
A sign language sample is selected to simulate a real scene to show the gap between results of experiments (HYP i ) and real label (REF), as shown in Fig. 5.

Alignment results of ablation experiments.
It is also worth noting that although adding the L1 loss or L2 loss leads to smaller gains than the L4 loss only, adopting L3 loss can achieve further improvement.
Moreover, we refer to the current network commonly used for feature extraction. In temporal dimension, LSTM and Bi-LSTM (Bi-LSTM, Bi-directional Long Short-Term Memory) with 512 hidden states are also selected. 1D-CNN is designed as described in Section 4.3. Ablation experiments of different combinations of Gloss feature extraction modules are performed to prove the effectiveness of the Gloss feature encoder module proposed in this paper, as shown in Table 4.
Network performance with different Gloss feature encoder design
At the beginning of the design of the Gloss feature encoder, we consider the correlation between temporal dimension and spatial dimension in CSLR task. Therefore, Gloss feature encoder is performed step by step in the form of temporal encoding and spatial encoding. The experimental results also prove the effectiveness of our model.
From the perspective of model optimization, the current methods commonly used are iterative optimization and introducing additional supervised modules, where the former has the risk of introducing new errors and the latter expand the size of network, which all increase the time cost of training. Our proposed Multi-state Feature Optimization mechanism avoids the problems mentioned above. Even though the method introduces factor that cause network performance imbalance, it has achieved great results in solving the spike response of CTC and the contribution allocation of the end-to-end model.
Conclusions
In this paper, we propose an end-to-end fully convolutional network for continuous sign language recognition, using Resnet18 as the visual feature extraction model, employing two-stage Gloss feature extraction to further refine feature learning, and adopting CTC as the decoding model. Moreover, the Multi-state Feature Optimization mechanism is used to alleviate the spike response problem of CTC and the contribution distribution problem in end-to-end model, to a certain extent, and strengthens the training of the feature extraction module. Multi-state joint optimization also greatly reduces the convergence time of our model, while enable our model to achieve competitive results on the Dev set and Test set of the RWTH dataset.
