Abstract
Sign language is the primary way of communication between hard-of-hearing and hearing people. Sign language recognition helps promote the better integration of deaf and hard-of-hearing people into society. We reviewed 95 types of research on sign language recognition technology from 1993 to 2021, analyzing and comparing algorithms from three aspects of gesture, isolated word, and continuous sentence recognition, elaborating the evolution of sign language acquisition equipment and we summarized the datasets of sign language recognition research and evaluation criteria. Finally, the main technology trends are discussed, and future challenges are analyzed.
Introduction
Sign language is a visual gesture language of daily communication between hard-of-hearing and deaf people. It uses different body parts, such as fingers, hands, arms, head, body, and facial expressions, to convey information [1]. Every country and every region has its distinct sign language, which makes it impossible for people with hearing disabilities from different countries or regions to communicate.
China has the largest number of hard-of-hearing persons in the world. Moreover, the deaf and hard-of-hearing makes up an estimated population of 27.8 million in China alone, with one deaf person in every 100 people. Many people use spoken language to communicate in daily life. Almost no one understands and uses sign language. It becomes very difficult for communication between the hearing and the deaf community, with the result that deaf and hard-of-hearing people are isolated and find it difficult to integrate into society.
Sign language recognition involves research in various disciplines, such as spatial geometry, pattern recognition, probabilistic statistics, artificial intelligence, natural language processing, and image analysis and processing [1]. These subjects and technologies are applied in the research of sign language recognition. At the same time, the research results promote the development of the above fields. In sign language recognition, the results help improve the living, learning, and working environment of the deaf-mute. It can popularize standardized sign language better and enrich the teaching mode if applied in deaf-mute education. In addition, researchers have applied sign language recognition technology to other related fields, such as traffic police gesture recognition, intelligent home appliance control, film production, virtual games, etc. This research brings benefits to a wide range of social settings.
Sign language recognition has received more and more attention with computer technology development and intelligent human-machine interface research. This article summarizes and discusses the current status and development trend of sign language recognition research in recent years. The main contributions are as follows: the current status of sign language recognition research in recent years, with reference to future research; the datasets in the field of sign language recognition, so that researchers can easily access this information; some innovative ideas for the research and development of sign language recognition in the future.
The article is organized as follows: Section 2 introduces sign language acquisition equipment. Section 3 summarizes the development of sign language recognition research technology from the three aspects of gestures, isolated words, and continuous sentences. Section 4 presents the sign language data sets and evaluation criteria in China and abroad. Section 5 gives the conclusions and future work.
Data acquisition equipment
The process of sign language recognition includes a number of general categories, namely, data acquisition, pre-processing, segmentation, feature extraction, and classification. Data acquisition is the primary and most important step. Its quality is related to the whole effect of sign language recognition. Throughout the history of sign language recognition research, data acquisition equipment can be divided into three categories: data sensors, Kinect sensors, and cameras.
Sensor-based recognition utilizes a data glove and position tracker to measure the angular information of each joint of the hand. It uses trajectory and spatiotemporal information of hand movements for sign language recognition. Firstly, the data glove feeds back data from each joint and then returns the hand’s 3D coordinates through a position tracker to measure the gesture’s position information in 3D space. The research on sensor-based sign language recognition started in 1982, and in 1983, Grime [2] of the American Telephone and Telegraph Company obtained the patent for the Digital Data Entry Glove Interface Device. He is considered the first person to research sign language recognition, This device can recognize 72 one-handed letters. Fels et al. [3] used VPL data gloves and Phloem’s position tracker as an input device to build a sign language recognition system. The system uses five neural network opponents’ movement trajectory, movement direction, offset, movement speed, and hand shape to extract features for recognition. Yu et al. [4] proposed a Chinese sign language recognition method based on Deep Belief Net, using two 6-D inertial sensors and eight sEMG sensors as input devices to collect datasets. The sensors were placed on both sides of the participants. On each side, one 6-D inertial sensor was placed on the back of the forearm, and four sEMG sensors were placed on the four target forearm muscles, which are the little finger extensor, palmaris longus, extensor carpi ulnaris, and extensor carpi radialis. The data collection was conducted in five different sessions, and a total of 3750 CSL words were collected. Literature [5] uses five flex sensors, an Inertial Measurement Unit (IMU), and three Force Sensing Resistors (FSR) to detect the degree of finger bending and hand movement information. The detected data are transmitted to the computer by an Arduino Micro. The data glove can directly obtain the coordinates of a human hand in the 3D space, the degree of bending of the fingers, and the direction and speed of movement. The disadvantage is that it is very inconvenient to wear and use and thus, it is still in the laboratory research stage.
Due to the high price of data gloves, researchers have been looking for more economical equipment. Microsoft launched a somatosensory device called Microsoft Kinect on June 14, 2010, primarily for video game players. However, researchers have widely used it as research equipment due to its ability to perceive the sound, gesture, and skeletal movement information and provide color and depth video streams. In the work using Kinect for sign language recognition [6], Chai et al. [7] placed the device in front of the body to obtain RGB and depth information and then fed it into the neural network for recognition. Feng et al. [8] conducted a palm segmentation study using the color and depth information collected by the Kinect sensor. Bencherif et al. [9] developed a novel Arabic Sign Language (ArSL) recognition system, which uses two Kinect cameras to collect sign language videos from different angles and input the hand and body key points selected from the sign language video frame into the parallel 1D-CNN for recognition. Although the Kinect device can provide more information, it is not convenient to carry. Microsoft decided to discontinue the Kinect device in 2014, so it has gradually faded from the field of sign language recognition research.
Recent research demonstrated that the addition of depth images does not bring about improvement in sign language recognition. In comparison, the RGB information of images is greater. Therefore, scholars gradually abandoned the device and instead use cameras that are simple and can easily obtain RGB information for image input. Sign language recognition input data is in the form of gesture images, which can be easily captured by the camera. Some researchers use more advanced active technologies such as standard cameras, webcams, stereo cameras, thermal imagers, and other image acquisition equipment to avoid problems when using data gloves and somatosensory equipment. For this reason, scholars have gradually turned to the camera or camcorders that can easily obtain RGB images. References [10] use the camera to capture the image and then perform the image analysis and gesture recognition. It’s easy and fast to capture information with a single camera, but only from one viewpoint, thus many researchers use multiple cameras to capture multi-directional information. In 2007, Ten et al. [11] collected 121 Dutch sign language words with cameras. Each word extracted 6-dimensional features and presented an algorithm for Dynamic Time Warping (DTW) on multi-dimensional time series (MD-DTW). Similarly, Kishore et al. [12] used four cameras to obtain multi-azimuth image information from the center to the positions of –20°, –10°, 10°, and 20°, respectively. In 2021, researchers [13] used a high-definition web camera to collect gestures of different positions, angles, and directions that volunteers made freehand in front of the image capture camera. And Park et al. [14] proposed a system based on depth video for translating sign language into text using the front camera of a smartphone. Most sign language recognition research is camera-based because of the low cost of acquiring data, clear images, and portability.
The analysis table of data acquisition equipment is shown in Table 1, summarizing the changes of sign language data collection equipment in the past 20 years. It can be seen that although it is very inconvenient to use data gloves and Kinect equipment to collect sign language, there are still researchers who continue to use them. This is because the data glove and the Kinect device can sample multi-modal information, and the complementarity between multi-modal information in sign language recognition has a certain effect on the recognition of sign language. However, from the perspective of the development and application of sign language recognition in the future, cameras are more suitable and convenient for communication and the use of deaf people in daily social life.
Data acquisition equipment analysis table
Data acquisition equipment analysis table
Sign language can be divided into gesture recognition, isolated word recognition, and continuous sentence recognition. The following explains these three aspects.
Gesture recognition
The object of gesture recognition is the shape of the hand in a single picture. Compared with isolated words, the semantics of gestures are simple and clear. Recognition mainly focuses on accurately recognizing the meaning of gestures and improving the recognition rate. Different gesture models determine the diversity of gesture recognition methods. The research methods mainly include three main frameworks: Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and Convolutional Neural Networks (CNN) based methods. The specific classification is shown in Fig. 1.

Gesture recognition classification.
Dynamic Time Warping (DTW) algorithm is an early and classical algorithm based on dynamic programming and performs normalization and pattern recognition for non-linear time. It uses the non-linear warping function to eliminate the non-linear fluctuation in time. Thus, eliminating the difference between the spatiotemporal representation modes in different time axes.
Assume that a1, a2, …, a
m
, …, a
M
are the feature sequences in the reference template, b1, b2, …, b
n
, …, b
N
, M ≠ N are the feature vectors of the input module, the DTW algorithm finds a non-linear mapping relationship, called the time normalization function:
This function can clearly express the mapping of the time axis (m and n) between the reference sample and the input sample, and meet the following conditions:
Where d [n, w (n)] is the distance between the feature vector of the n-th frame of the input sequence and the feature vector of the mth frame of the reference sequence, D is the total distance between the input sample and the reference sample in the optimal time normalization state.
In gesture research Konecny et al. [15] added a histogram of oriented gradients (HOG) of hand and a histogram of optical flow (HOF) descriptor in the DTW to improve the accuracy of gesture recognition. Since DTW needs to perform matching calculations on the paths and all nodes, the amount of calculation is too large. Huang et al. [16] proposed a Multi-dimensional input dynamic time warping (MD-DTW) method for gesture recognition. It imposes boundary constraints on globally planned paths and reduces the number of computations required for template matching, saving unnecessary regularization time, and improving the algorithm’s robustness. Because it has bounded the regular path, it will cause a lack of features and bring the wrong prediction when matching. To a more comprehensive expression of the feature of gestures, Zhou et al. [17] proposed an algorithm based on Global Template Dynamic Time Warping (GTDTW). The global template obtained by this method can fully express the characteristics of gestures. Simultaneously, it reduces time consumption and ensures real-time gesture recognition. The global template length is also smaller than that of the conventional template of dynamic time warping.
The DTW algorithm improves the rate of gesture recognition, but it does not perform well in recognizing large amounts of data, complex gestures, and combined gestures. Therefore, the Hidden Markov Model (HMM)—a method based on pattern recognition—came into being. HMM is a commonly used state-space-based method and a mature mathematical method for matching time-varying data. Generally, HMM is expressed as a triplet μ = (A, B, π), where π is the probability distribution of the initial state, A is the state transition probability, and B is the symbol emission probability. The state sequence in HMM is invisible, so the expectation-maximization method (EM) is used, which can be used to estimate the maximum likelihood of the parameters of a statistical model with hidden variables. The calculation steps are as follows:
E-step: calculate the expected ɛ
t
(i, j) and γ
t
(i) from the model μ
i
according to Eqs. (3) and (4):
M-step: Using the expected value obtained in E-step, re-estimate the values of the parameters π
i
, a
ij
, b
j
(k) according to Eqs. (5), (6), and (7) to obtain μi+1:
Calculate in a loop, let i = i + 1. Repeat the EM calculation until π i , a ij , b j (k) converge.
HMM was mainly used in the field of speech recognition in the past. In gesture recognition, it has developed rapidly in recent years, and many representative research methods have emerged. Yan et al. [18] adopted the SWAB algorithm for automatic endpoint detection, HMM for gesture instructions modeling, and adopted the K-means algorithm for vector quantization of gesture feature sequence. Although non-verbal communication mainly uses hand and finger movements, it is not limited to these. In [19], the author extends the gesture-based framework to full-body posture and selects an innovative feature descriptor based on active differential features to pass the HMM model on behalf of traditional geometric features to obtain a satisfactory recognition rate. With in-depth research, people found that the basic HMM model has been unable to learn more features, and its generalization ability is limited. Yang et al. [20] proposed a frag-HMM method, which obtains the similarity of gestures from a typical and independent video stream without accurate segmentation. With the rapid development of machine learning, scholars began to try to combine traditional HMM with machine learning models to obtain greater accuracy. In [21] Koller et al. combined the CNN classifier trained by GoogLeNet’s Inception-VL model with HMM. In another work, [22] proposed a combination of HMM and FNN for dynamic gesture recognition. It deconstructed the gesture image into three feature sequences: handshape change, hand position change on the two-dimensional plane, and movement in the Z-axis direction, And then an HMM model was established for each of them, and fuzzy inference is used to connect the FNN to determine the semantics of the gesture. The model can quickly and effectively recognize complex dynamic gestures. In addition, Deep Belief Networks are applied in gesture recognition. Chen et al. [23], presented a combined HMM with Deep Belief Networks. Meanwhile, the embedded time-level dictionary into the hidden Markov model introduced relative entropy to measure the dictionary’s information richness. It verified the validity of the proposed method on two benchmark gesture data sets.
HMM needs to pass a large number of training samples to obtain statistical characteristics. With more training samples, the performance of HMM is better, but the amount of calculation is huge, which affects the real-time performance of the system. With the rapid development of deep learning, neural networks have shown outstanding learning capabilities in various fields. Convolutional Neural Networks (CNN) are usually composed of a convolutional layer, a pooling layer, a normalization layer and a fully connected layer. The network structure is expressed as: Enter -> [Conv ->ReLU]*N -> [Pooling]*M -> [FC ->ReLU]*K ->FC, the formula is as follows:
Where Conv is the convolutional layer, ReLU is the normalization layer (due to space issues, ReLU is omitted in formula (8)), Pooling is the pooling layer, FC is the fully connected layer, and * is the number of repetitions. N> =0, M> =0, K> =0, usually K < 3. There are different settings for different research methods with different specific values.
After neural networks were introduced into the field of gesture recognition, methods based on CNN showed the most advanced performance in human gesture/action recognition. Kang et al. [24] used CNN to extract features for deep image sign language recognition. Karpathy et al. [25] extended CNN connectivity in the temporal domain and designed a multi-resolution centralized architecture using local spatiotemporal information to expedite the training. Liang et al. [26] projected the point cloud of the hand to different view planes, used CNN to extract the features in the view, and SVM for gesture recognition training. The experiment proved that the method has higher robustness but a poor generalization. Although CNN has strong feature extraction capabilities, it is limited to extracting time information and cannot extract spatial information. Compared with CNN, 3D-CNN further applies the convolution kernel to the space field. The spatiotemporal characteristics of the convolution kernel can solve the disadvantages of CNN. The network based on C3D [27] was successfully used in the 2016 ChaLearn LAP Large-scale Isolated Gesture Recognition Challenge [28–30], showing the most advanced performance, and since then it has been widely used by scholars in gesture research. In [31] the authors used the 3D-CNN model to learn depth and intensity data separately and combined the information of multiple spatial scales for gesture prediction. Similarly, Li et al. [29] proposed 3D-CNN-based network architecture, introduced RGB image data and depth image data, and achieved good results. [32] proposed a ResC3D network model based on multi-modal data for gesture recognition. Molchanov et al. [33] combined 3D-CNN and RNN to detect and classify multi-modal data for dynamic gestures, which enhanced the generalization of the network model. Zhu et al. [34] used 3D-CNN and convolutional LSTM networks to learn Spatio-temporal features end-to-end and used multi-modal data for fine-tuning to further optimize the network model. Researchers in [35] used two streams of 3D-CNN for learning the fine-grained features of the hand shape and the coarse-grained features of the global body configuration. The experimentally tested results show that the proposed system outperforms state-of-the-art approaches, demonstrating its effectiveness. Yu et al. [36] utilize shallow two- stream CNNs to capture the original video frame’s low-level features and corresponding optical flow. In addition, they utilized an attentive feature fusion module to selectively combine useful information from the previous two streams based on the attention mechanism. This method achieves an accuracy of 95.77% on the Jester dataset.
Table 2 shows that the main technical framework lists gesture recognition technology and representative work. As for the representative work using the same framework, there is no comparison between them due to different data sets. Therefore, we list the representative work using the same framework in chronological order. At present, gesture evaluation is based on accuracy.
Gesture recognition analysis table
Unlike gestures, isolated words are sequences of time and space which are more structured gestures with precise meanings in sign language books. The traditional sign language recognition method has inadequate learning ability and insufficient adaptability. In contrast, the deep-learning sign language recognition method for isolated words has strong fault tolerance and adaptability. Therefore, this section will be described in detail based on the framework of convolutional neural networks (CNN) and Recurrent Neural Networks (RNN). The specific classification is shown in Fig. 2.

Isolated word recognition classification.
CNN is a locally activated feedforward neural network, which has three basic characteristics: perceptual field, weight sharing, and downsampling. And it has been widely used in speech recognition, license plate recognition, face recognition, and other fields. Its high-efficiency recognition accuracy and speed can also promote sign language recognition for isolated words. Research [37] and [38] both use CNN to obtain image features: the former uses multi-scale but all levels of image features, while the latter focuses more on hand changes. Besides hand features, information such as motion features and facial features can play a more important role in isolated word recognition. Kopuklu et al. [39] proposed a data-level fusion strategy for fusing motion information into static images at the CVPR2018 conference and sent the fused spatiotemporal features to the CNN network for subsequent classification and achieved good recognition results.
CNN has achieved great results in isolated sign language recognition, but it focuses only on each frame’s features without considering the inter-frame motion information. Scholars have introduced the 3D-CNN network to solve this problem, capturing the video’s temporal and spatial feature information. The breakthroughs of sign language recognition based on 3D-CNN are different fusion models and spatiotemporal attention mechanisms. Huang et al. [40] proposed a novel 3D-CNN model for isolated word recognition. To boost performance, multi-channel video streams, including color information, depth clue, and the body’s joint positions, are used as input to the 3D-CNN to integrate color, depth, and trajectory information. In a similar work, Liang et al. [41] proposed a 3D-CNN network based on multi-modal data input and convolution fusion for various data, which verified its effectiveness on large-scale data sets. However, its generalization is not high on sign language data sets composed mostly of RGB data. In another work, Wu et al. [42] proposed a 3D-CNN based on multi-channel data fusion for isolated word recognition. By incorporating RGB and depth information stacking into the 3D-CNN, Wu introduced the HMM model for short-time series modeling, verifying this method’s effectiveness on the ChaLearn data set. Adding the spatiotemporal attention mechanism into 3D-CNN can capture the most critical sign language movements in the sign language video and obtain superior recognition accuracy.
In 2018, Huang et al. [43] proposed an attention-based 3D-CNN model, which benefits from the learning ability of spatiotemporal characteristics of 3D-CNN, and simulates human visual mechanism processing by finding and paying attention to regions of interest. The results show that this method’s recognition rate is higher than that based on C3D in [40], which indicates the effectiveness of spatial and temporal attention mechanisms. Paper [44] further explored the ability of cascaded 3D-CNN to capture sign language information. With the continuous evolution of 3D-CNN and the proposal of C3D-Ret, sign language recognition has been further improved. For ordinary networks, as the depth of the network increases, gradient diffusion or gradient explosion will occur, but the residual network uses its internal residual block to make jump connections, which alleviates the disappearance of gradients caused by increasing depth in the deep neural network. Miao et al. [45] proposed a multi-modal gesture recognition method based on the ResC3D network in ICCV2017. This method’s key idea is to find a compact and effective video sequence representation and propose a feature fusion method based on regular correlation analysis for feature extraction, which won first place on the ChaLearn dataset. Liao et al. [46] combined ResC3D and bidirectional LSTM networks for sign language recognition. The method used Faster R-CNN to locate and identifies key hand part points to obtain the most advanced identification accuracy. Since only the characteristic information of the hand is concerned, it neglects the other useful information. Hand joints and whole-body skeleton data play an auxiliary role in the extraction of sign language video features. Razieh et al. [47] use the estimated 3D key points and midpoints of the RGB image to construct a hand skeleton and obtain different representations from key points of the hand. Then the skeleton was projected onto three surfaces and converted into an image format, and finally fed to 3DCNN to obtain different local spatiotemporal complementary feature representations. Jiang et al. [48] proposed a perceptual multi-modal sign language recognition framework that extracts whole-body skeleton data, optical flow data, and depth information from RGB sign language video sequences. The fusion of multi-modal information enables the framework to learn complementary global information. Meng et al. [49] extract skeleton information from RGB images, use graph convolutional networks (GCN) for multi-scale feature extraction and propose a keyframe extraction algorithm that improves efficiency at the expense of some accuracy.
Recurrent neural network
The above CNN structure has achieved remarkable results, but CNN can only extract the short-term spatiotemporal features due to its constraints. To improve isolated word recognition accuracy, researchers began to use RNN to extract long-term spatiotemporal features. The RNN model calculation is divided into two steps. The first step is to calculate the hidden layer h
t
at the t-th time step; the second step is to calculate the predicted value y
t
at the t-step:
Where, W
x
W
h
andW
y
are weights, and b is the bias value.
Research based on RNN has: Chai et al. [50] proposed a two-stream RNN network for isolated word recognition. One input of the network is RGB data. Another network input is the feature fusion of extracted skeleton data and a gradient histogram. Finally they used the two-stream network fusion for the final score. This method ranked first in the ChaLearn Gesture Recognition Challenge. In isolated word recognition, preserving spatially related information will obtain more meaningful spatiotemporal features. Therefore, many kinds of research have emerged that combine CNN networks with RNN networks. Ye et al. [51] proposed a 3D-RNN network model by combining 3D-CNN and RNN to capture sign language recognition’s temporal and spatial information. 3D-CNN learned multi-modal features from RGB, motion, and depth channels. FC-RNN captured the temporal information between the divided short video segments from the original video. The 3D-RCNN can recognize and locate the semantic information of different video lengths, improving recognition accuracy. In another work, to solve redundant information that affects the recognition accuracy, Huang et al. [52] proposed a sequence-to-sequence learning method based on keyframes for a sign language recognition algorithm. The multi-modal data stream’s keyframes were embedded in CNN, DMM (in essence, RNN), and the network used to extract trajectory information, allowing different attention to the input data. The experiment has obtained a remarkable effect on the dataset of 310 Chinese sign language words. Moreover, in another work, Lin et al. [53] combined the Res-C3D network with mask and LSTM network for skeleton data modeling and they combined segmentation algorithm and classification network to study isolated words. This method improves the recognition accuracy, but the early masking process is time-consuming. Ameur et al. [54] proposed a Hybrid Bidirectional Unidirectional LSTM (HBU-LSTM) framework that combined the LSTM network and BLSTM network. The model is capable of efficiently classifying the input data. The experiment obtained a remarkable result on the LeapGestureDB dataset. Santos et al. [55] proposed an approach called star RGB representation, which can describe and condense a video clip containing a dynamic gesture on one RGB image, then used ResNet convolutional neural network for feature learning. The method achieves an accuracy of 98% in the GRIT dataset, which proves its effectiveness. Literature [56] developed an SLR method based on a boundary adaptive encoder combined with window attention. The key idea is to use a boundary detection unit (BDU) to learn the time boundary of the input symbol sequence and embed the boundary detection unit BDU into BLSTM; the length layer of the coding block can be adaptively controlled according to the input information and the hidden state of the coding. In 2021, Rodri et al. [57] proposed a Common Spatial Patterns (CSP) algorithm for feature extraction, using multiple classifiers for classification, reaching 97.95% in Argentinian Sign Language. The representative work analysis table of isolated word recognition technology is shown in Table 3.
Isolated word recognition analysis table
Table 3 shows that the main technical framework lists isolated word recognition technology and representative work. As for the representative work using the same framework, there is no comparison between them due to different data sets. Therefore, we list the representative work using the same framework in chronological order. As can be seen from Table 3, 3D-CNN and RNN network architectures have been more accurate in isolated word recognition research in recent years. Combining these two technologies retains the spatial features extracted by CNN, learns long-term temporal information through the RNN network, and finally obtains more meaningful spatio-temporal features, which greatly improves the recognition performance of gestures and isolated words. At present, both gesture evaluation and isolated word evaluation are based on accuracy.
Unlike isolated word sign language, continuous sign language recognition includes transitional frames (Transition frame refers to a frame that is used to connect adjacent sign language words but has no actual meaning). The purpose of CSLR is to identify annotation in a sign language sequence. It is different from isolated word recognition, in which each symbol is independently segmented and annotated. It is also different from sign language translation (SLT) [58]. Sign language translation involves an extra step, that is, translating recognized annotation into a grammatical sentence. For the trained model, the core problem of sign language recognition, speech recognition, and machine translation is in transforming (input) sequences from one domain into (output) sequences in another. They can be abstractly transformed into variable-length sequence learning problems, however, due to the unfixed sequence length and no correspondence between input and output sequence length, traditional neural network models (DNN, CNN, and RNN) cannot directly solve such problems of modeling and learning end-to-end manner. RNN can solve the modeling problem of long-time series well. The common methods are LSTM, BLSTM, GRU, etc.
The main sign language recognition methods include Hidden Markov Model(HMM) recognition, Connectionist Temporal Classification(CTC) recognition, Seq2Seq (encoder/decoder) recognition, and other recent research. The specific classification is shown in Fig. 3.

Continuous sign language recognition classification.
HMM is also widely used in continuous sign language recognition. In related work, Koller et al. of Aachen University of Technology, conducted continuous sign language recognition research based on German sign language data set from 2009 to 2012. In 2016 [59], they proposed a continuous sign language hybrid model based on CNN and HMM. The model combines deep learning with traditional methods and has achieved good recognition results on two publicly available large-scale benchmark sign language data sets. It is a milestone study in continuous sign language recognition research. In 2016, Koller et al. [60] embedded BLSTM based on the original framework and proposed an iterative realignment algorithm that embeds the CNN-BLSTM network into HMM. The algorithm used multiple realignments to improve the recognition performance of the model. Since the HMM method is a mathematical method based on matching time-varying data, it only achieves better results in the case of large samples. However, as the samples increase, the amount of calculation increases geometrically, which seriously affects real-time performance. It has been gradually abandoned in continuous sign language recognition in recent years.
Connectionist Temporal Classification(CTC)
CTC is a way to avoid input and output, allowing RNN to directly learn the sequence data without marking the mapping relationship between the input sequence and the output sequence in advance training data. The training process of the CTC-based network framework is usually shown in Fig. 4.

CTC-based network framework.
The CTC loss calculation process is: Let
Where B-1 (y) is the set of all alignments.
Using a CTC training network can realize end-to-end continuous sign language recognition and research methods for this activity can generally be divided into using single CTC loss and using multiple CTC loss. In recent years, related work using single CTC losses has been Pu et al. [61] from the University of Science and Technology of China who proposed a continuous sign language recognition framework based on 3D-ResNet dilated convolutional network in 2018. They used an optimization strategy based on the CTC algorithm to fine-tune the feature extractor with pseudo-tags and verified its effectiveness and superiority over the German benchmark dataset. In 2018, Guo et al. [62] proposed a network structure (DenseTCN) based on 3D-CNN and stacked time convolution. They used CTC to learn feature classification and generate translated sentences. These studies are based on video frame single-mode data as input, which is simple and fast, but the information is thinner than multi-modal data, so a batch of studies using multi-modal data as input has emerged. Camgoz et al. [63] proposed a SubUNets framework at the ICCV2017 conference and used CTC to solve the alignment and recognition problems. This method combines the modal information of both the full-frame image and the hand image. When focusing on the full-frame image information, attention can be paid to key gesture information. In another work, Cui et al. [64] proposed a continuous sign language recognition system with a recurrent convolutional neural network for optical flow images and RGB frame multi-modal data. They used CTC’s alignment suggestions as weak supervision to fine-tune the feature extraction module and used iteration strategies to improve performance. They verify the algorithm on two publicly available benchmark datasets.
CTC loss can be calculated to adjust the network, but a single CTC will be prone to overfitting. To solve this problem, scholars use multiple losses to jointly adjust the network. In related work, Cui et al. [65] designed a three-stage optimization processed to train the network: CTC loss for end-to-end training, KL-div loss for feature learning, improved features for sequence learning, and demonstrated the effectiveness of this optimization strategy in a challenging benchmark. In another work, Wang et al. [66] propose a deep hybrid architecture that consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the CSLR problem. The proposed joint CTC loss optimization and deep classification score-based decoding fusion strategy to boost performance. This method creates a new benchmark in both the RWTH-PHOENIX-Weather dataset and the CSL dataset, but the network architecture was relatively complex and required high hardware requirements. Furthermore, in another similar work, [67] proposed to use a parallel time encoder (PTEnc) network to simultaneously learn the local and global time relationship of video. They designed a reconstruction loss to measure the distance between the original visual and reconstruction features. They combined it with the CTC loss to realize end-to-end optimization. Similarly, Yang et al. [68] proposed a structured feature network (SF-Net). The proposed model extracts feature in a structured manner and gradually encodes information at the frame level, the gloss level, and the sentence level into the feature representation. It used a KL-div loss based on gloss-level tags and a CTC loss based on sentence-level to jointly optimize the network. Test results on two large-scale public SLR datasets showed that the proposed SF-Net outperforms previous sequence-level supervision-based methods in terms of accuracy and adaptability. Aiming at the insufficient training of feature extractors due to the easy overfitting of CTC, Min et al. [69] proposed a visual alignment constraint (VAC) to enhance the feature extractor through more alignment supervision, where VAC includes two parts of the loss. One part is used to predict the visual features of opportunity, and the other is used to align short-term verse features and long-term context features. Its effectiveness has been verified on two public datasets. Gao et al. [70] proposed an alignment network from visual level to vocabulary sequence and introduced an RNN translator in the field of continuous sign language recognition for the first time to learn the best alignment between sign language video and sentence level tags.
Since CTC assumes that the original sequence and the target sequence share the same order and assumes conditional independence within the target sequence, the network cannot learn the implicit language model. Therefore, the seq2seq (Encoder-Decoder) model structure that can simultaneously focus on the source sequence and the target sequence has also been extensively developed. The main idea of the encoder-decoder network is to map two sequences with intermediate latent space. In other words, it must encode the source sequence into a vector of fixed size and then decode the target sequence. However, there is a problem due to the source sequence’s encoding as a fixed size vector and the long-term dependence between the source sequence and the target sequence. To solve this problem, Bahdanau et al. [71] proposed using the attention mechanism to pass other information to the decoder so that it could find the most relevant original sequence information at every moment of decoding. Since then scholars have widely used attention mechanisms in encoder/decoder networks. The most influential work is that Camgoz et al. proposed a framework model combining CNN and attention-based encoder-decoder at the 2018 CVPR conference [58] (the framework diagram is shown in Fig. 5). It uses CNN as the spatial embedding layer and hybrid RNN+HMM as the tokenization layer and then uses the attention-based encoder-decoder network to train the network framework in an end-to-end manner. This work has achieved good results on public data sets. As a result, the code of the algorithm is published, which provides a reference for new researchers in this area and further promotes the development of continuous sign language recognition research. Guo et al. conducted a series of continuous sign language recognition research based on the seq2seq model. Guo et al. [72] proposed a continuous sign language recognition framework based on asymmetric multi-layer LSTM in 2018. The attention-based weighting mechanism proposed by them balances the semantic relationship in the feature learning process while focusing on redundant information. Aiming at the problem that redundant information affects the recognition accuracy, this research proposes a feature mining method and pooling strategy for key blocks with edge length, which effectively improves the learning efficiency of the model for sign language change patterns and the accuracy of sign language translation. In the same year, they proposed a hierarchical HLSTM codec model using a time-weighted attention-weighting mechanism at the AAAI conference [73], which handles sign language recognition of different granularities through spatio-temporal conversion between frames, clips, and viseme units. The algorithm not only retains the original video features but also obtains advanced features. It has achieved good recognition results in the Chinese Sign Language dataset, but the model is too complex, and the recognition speed of multi-layer HLSTM is slow. In order to make full use of the complementarity between short-term spatiotemporal features and long-term spatiotemporal features, Guo et al. [74] proposed a temporal convolutional pyramid module to learn short-term temporal correlation from 2D-CNN features and then complement the original 3D-CNN features, learning and embedding dynamic programming into the decoding scheme. To solve the problem of slow recognition speed, they proposed a hierarchical deep recursive fusion (HRF) framework in 2020 [75], which uses an encoder composed of adaptive clip summary (ACS) and LSTM to explore RGB view position and bone information, and uses LSTM and a query adaptive decoding fusion to translate the target sentence. Pu et al. [76] proposed a network that uses CTC loss and cross-entropy loss to jointly train an encoder-decoder under the constraint of soft Dynamic Time Warping (soft-DTW). The algorithm uses strong supervision to achieve better performance on the German sign language dataset and the Chinese sign language dataset. Huang et al. [77] proposed a latent space hierarchical attention network (LS-HAN) for recognizing continuous sign language. It uses LSTM’s extended hierarchical attention network (HAN) to eliminate the pre-processing of time segmentation and reduce the loss of information, but due to the excessive number of frames, gradient explosion is likely to occur during LSTM calculation, and the calculation speed is slow. In response to this problem, Cheng et al. [78] performed end-to-end sign language recognition based on a fully convolutional network and introduced Gloss Feature Enhancement (GFE) loss to perform additional corrections to monitor the effectiveness of the method in the German sign language dataset.

NSLT Model [58].
In 2020, [79] proposed a novel encoder-decoder architecture based on the Transformer that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This method utilized CTC loss to inject gloss-level supervision into the Transformer encoder, training it to sign language recognition while learning meaningful sign language translation representations. They are the first to apply the transform structure in NLP to sign language recognition and achieve the most advanced PHOENIX14T dataset results. Subsequently, Necati et al. [80] introduced gestures, mouth shapes, and bone information for multi-modal interactive learning on this basis and used Transformer to simultaneously model intra-modal and inter-modal contextual relationships, which not only maintained various specific information of the modal, while learning the complementary information between the models.
Zhou et al. [81] proposed a self-attention (SAFI) network in 2020, introduced full inception with different receptive fields to extract dynamic segment-level features, and used Aggregated Cross-Entropy (ACE) loss and CTC loss combined. The global sequence feature learning is trained, and the proposed model is optimized. Recently, a research paper [82] introduced a new method for context-aware continuous sign language recognition using a generative adversarial network architecture. This method includes a generator that recognizes sign language vocabulary by extracting spatial and temporal features from video sequences and a discriminator that evaluates the quality of the generator’s predictions by modeling text information at the sentence and vocabulary level. By competing with each other, both the generator and the discriminator have been improved, resulting in more accurate and robust SLR results. Continuous sign language recognition technology and representative work are shown in Table 4.
Continuous sign language recognition technology and representative work analysis
From Table 4, it can be seen that the application of the Transformer structure to sign language recognition/ translation achieved the best results. Because the transform structure mainly adopts the self-attention mechanism instead of the previous RNN based, which avoids the gradient disappearance/explosion problem and solves the information bottleneck problem in an encoder-decoder structure. Therefore, the future research direction of sign language recognition should continue to develop along the direction based on Transform.
Dataset
With the development of sign language recognition research, the demand for sign language datasets is also expanding, and sign language datasets of various sizes and characteristics across the world have been launched one after another. This article summarizes the datasets that have been widely used in the field of sign language recognition over the years. The detailed classification of sign language datasets is shown in Table 5. From the research on sign language recognition over the years, most work is based on the SIGNUM dataset, RWTH-Phoenix-Weather dataset, CSL dataset, Montalbano dataset, and Chalearn IsoGD dataset. At the same time, CVPR 2021 organizes the ChaLearn LAP evaluation using the AUTSL dataset. The independent and isolated SLR challenge for large-scale signers attracted a large number of participants. Therefore, the information about these six datasets will be described in detail below. Figure 6 shows some sample datasets.
A summary of sign language datasets
A summary of sign language datasets

Some samples of sign language datasets.
The SIGNUM database [84] contains isolated words and continuous sentences of various signers. To quickly and randomly access a single frame, it stores each video clip as a series of images. The database includes 455 basic symbols in German Sign Language, representing different word types. A total of 780 sentences are constructed based on this term. The length of each sentence ranges from 2 to 11. The entire corpus has 455 basic symbols and 780 sentences. It was performed by 25 signers of different genders and ages.
Rwth-phoenix-weather dataset
RWTH-Phoenix-Weather (2012) [89] is a daily news and weather forecast recorded by German Phoenix Public Television for three years (2009–2011). It consists of 386 subsets, all videos at 25 frames per second, the frame size is 210×260 pixels, and performed by 7 different signers. The entire corpus consists of 5,356 sentences, 45,760 running glosses, about 600k frames, and has a vocabulary of about 1,200 signs.
CSL dataset
Chinese Sign Language (CSL) Dataset [90] has been collected by the University of Science and Technology of China (USTC) with Kinect devices since 2015. It contains 25K video examples, captured by 50 operators, and has RGB, depth, and skeleton joint data. There are 500 kinds of words, 100 sentences, and a total of 5,000 videos. Each sentence contains an average of 4–8 words. A professional Chinese Sign Language teacher identifies each video example.
Montalbano dataset
Different from the above three standard sign language data, Montalbano [92] is an isolated gesture dataset composed of a series of continuous dynamic gestures. The dataset contains 13,858 RGB-D gesture video samples. Each RGB-D video only represents one gesture, totaling 20. A gesture tag is sampled and executed by 20 operators.
Chalearn Isogd dataset
ChaLearn LAP IsoGD [93] is an extended version of the Montalbano dataset. The total data length of about 24 hours, including 13,858 gesture instances. Due to technical methods’ generality, ChaLearn LAP IsoGD is a test dataset for sign language recognition research. It consists of 47,933 videos, including RGB and depth data. Each RGB-D video represents only one gesture and was performed by 21 operators.
In the recent sign language recognition research, gesture and isolated word recognition research mainly use the ChaLearn IsoGD data set. In contrast, the continuous sign language recognition research mainly uses the RWTH-Phoenix-Weather dataset, CSL dataset, SIGNUM Dataset, etc.
AUTSL dataset
AUTSL [95, 96] is a large multi-modal Turkish sign language data set proposed on CVPR in 2021. The dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth, and skeleton data modalities.
Evaluation criteria
The evaluation criteria of gesture, isolated word
Currently, the recognition benchmark of gestures and isolated words is accuracy. Accuracy rate refers to the proportion of correctly identified samples in all samples, as shown in Formula (13).
TP is the number correctly divided into positive examples. TN is the number correctly divided into negative examples; P is the number of positive samples, and N is the number of negative samples. Generally, the higher the accuracy, the better the recognition effect.
The labels of the dataset are complex and varied. Some words are replaced, deleted, or inserted to keep the consistency between sign language sequence and label sequence. Related evaluation indicators for continuous sign language recognition include word error rate (WER), BLEU, METEOR, Rouge, Precision, CIDEr, Acc, etc. Table 6 shows a summary of the evaluation criteria used in the continuous sign language recognition paper.
Analysis table of evaluation criteria for continuous Sign language recognition papers
Analysis table of evaluation criteria for continuous Sign language recognition papers
a) Word Error Rate (WER): It evaluates the similarity between the predicted sequence and the reference sequence. It measures the minimum number of edits (addition, deletion, replacement) from the target sequence to the reference sequence,
L is the total number of standard sequence words. I, D, and S represent the total number of inserted, deleted, and replaced words. Generally, the smaller the WER, the better the recognition effect.
b) BLEU: It analyzes the degree of co-occurrence of n-tuples in the candidate sequence and the reference sequence to determine the similarity between the two sentences,
Where,
BP is the penalty factor. Generally speaking, the higher the BLEU, the better the recognition rate.
c) METEOR: Lavir proposed after discovering the recall rate’s significance in the evaluation index in 2004. The goal is to address some of the flaws inherent in the BLEU standard. Unlike BLEU, METEOR also considers the accuracy and recall rate based on the entire corpus and finally gets a measurement. The calculation process is as follows:
Where α, β, and γ are the default parameters for evaluation, Pen is the penalty parameter. Generally, the higher the METEOR score, the better.
d) Precision: It refers to the ratio of correct sentences to the total number of sentences.
TP is the number correctly divided into positive examples; FP is the number incorrectly divided into positive examples. Generally, the higher the accuracy, the better the recognition effect.
e) Rouge: This metric represents matching the longest common sequence of two text units. It is a similarity measurement method based on the recall rate, which focuses on the reference’s adequacy and loyalty rather than its fluency. It is almost the same as BLEU’s calculation method, but the N-gram phrase is derived from the reference translation. Rouge is divided into four types: ROUGE—Co-occurrence statistics based on N-gram; ROUGE-L—Statistics of co-occurrence accuracy and recall rate based on the longest common clause; ROUGE-W—Statistics on the co-occurrence accuracy and recall rate of the longest shared clause with weights; ROUGE-S—Statistics based on discontinuous binary co-occurrence accuracy and recall rate.
f) CIDEr: it is a metric for image summary proposed by Vedantm at the 2015 Computer Vision and Pattern Recognition Conference to evaluate the similarity between computer-generated sentences and manual descriptive sentences. It calculates the cosine distance of the TF-IDF necklace between each reference sentence and the sentence. The calculation formula is as (23):
Where,
K is the K-th N-gram in a sentence, ci is the machine translation corresponding to the i-th picture, s ij is the j-th translation in the manual translation corresponding to the i-th picture (multiple), and M is the number of human translations corresponding to each picture.
As shown in Table 6, from the continuous sign language recognition research of recent years, we can find that the most widely used evaluation standard is WER. However, in the last two years, the BLEU evaluation standard has also been widely used. Evaluating standards such as METEOR, Rouge, Precision, CIDEr, and Acc have also come into scholars’ view to jointly evaluate the quality of a model.
This article reviews the main directions and research status of sign language research around the world in recent years, introducing some representative work of sign language recognition, the process of sign language recognition, and listing the current more popular sign language databases. It can be seen from this article that the research on sign language recognition has achieved initial results, and many effective algorithms have been proposed over time. However, there are still some shortcomings, such as the single type at the input stage, the inability to focus on chiral and achiral features at the same time, and the lack of key information for feature extraction. The direction of future research will be to propose a model with higher accuracy and real-time application in real life is sign language recognition.
Sign language recognition has extensive application potential in computer vision, human-computer interaction, and other fields and has always been a research hotspot. The following aspects will be worthy of the attention of researchers going forward: Pay attention to cross-disciplinary and cross-domain sign language research: The sign language recognition task should be combined with different fields to build a more lightweight, accurate, and real-time human-computer interaction method. At the same time, the application of sample-based cross-domain knowledge transfer and model-based transfer methods in sign language research is also one of the directions that can be explored and extended. Sign language recognition in complex backgrounds: sign language datasets used by researchers are recorded in a fixed environment using relatively fixed equipment. However, sign language is usually complicated, such as too bright or too dark, different distances from the collection equipment, more non-sign language, etc. Complex background factors will lead to greater accuracy, therefore, there should be an improvement in the robustness of sign language recognition systems. Improving applicability while ensuring accuracy is a more challenging issue. Sign language has excellent flexibility and details: Sign language uses different parts of the body, such as fingers, hands, arms, head, body and facial expressions, etc., which have an important influence on the semantic expression of sign language. Therefore, sign language recognition is still focused on using more abundant information about recognizing sign language and improving recognition accuracy. Detection of sign language word boundaries: Sign language’s word segmentation research of continuous sign sentences is also a significant research trend. Improvement to recognition algorithms: For continuous sentence recognition with an intense time sequence, the hybrid network structure will be the mainstream network algorithm in the future. The success of some algorithms in related fields, such as NMT, will also promote the further development of continuous sign language recognition.
Footnotes
Acknowledgments
This work was supported by the Collaborative education project of industry university cooperation of the Ministry of Education under Grant 202102594001, the Key teaching reform projects of Tianjin University of Technol-ogy under Grant ZD21-17, National Natural Science Foundation of China (Grant No. 61806071), the Open Projects Program of National Laboratory of Pattern Recognition (Grant No. 201900043), the Sci-tech Research Projects of Higher Education of Hebei Province, China (Grant No. QN2019207), and the Tianjin Sci-tech development strategy research planning Projects (Grant No. 18ZLZXZF00660).
