Abstract
Sign language recognition is a significant cross-modal way to fill the communication gap between deaf and hearing people. Automatic Sign Language Recognition (ASLR) translates sign language gestures into text and spoken words. Several researchers are focusing either on manual gestures or non-manual gestures separately; a rare focus is on concurrent recognition of manual and non-manual gestures. Facial expression and other body movements can improve the accuracy rate, as well as enhance signs’ exact meaning. The current paper proposes a Multimodal –Sign Language Recognition (MM-SLR) framework to recognize non-manual features based on facial expressions along with manual gestures in Spatio temporal domain representing hand movements in ASLR. Our proposed architecture has three modules, first, a modified architecture of YOLOv5 is defined to extract faces and hands from videos as two Regions of Interest. Second, refined C3D architecture is used to extract features from the hand region and the face region, further, feature concatenation of both modalities is applied. Lastly, LSTM network is used to get spatial-temporal descriptors and attention-based sequential modules for gesture classification. To validate the proposed framework we used three publically available datasets RWTH-PHONIX-WEATHER-2014T, SILFA and PkSLMNM. Experimental results show that the above-mentioned MM-SLR framework outperformed on all datasets.
Introduction
Sign Language (SL) is a visual linguistic system with Spatio-temporal structures on gestures adopted by deaf and mute people around the world. It is a solitary source of communication among mute individuals and also among deaf and mute persons and normal beings [1]. According to the statistics of the World Health Organization (WHO), 7.5% of the world’s population is deaf and hard-hearing, which is around 470 million people and it is going to be around 900 million by 2050 [2]. Sign Language is a complete natural language that is having its own linguistic properties and vocabulary set. There are many sign languages worldwide, also there is no clear association between spoken languages and sign language, even countries speaking the same spoken language may have different sign languages [3]. In a similar way, few countries have several sign languages just like several dialects of spoken languages. As a taxonomy, sign language can be categorized into three main categories: fingerspelling, isolated static hand signs, and continuous signs which are usually called dynamic signs either a single sign or alphabet or a set of words to maintain sentences [4].
Further, sign language can be categorized into two major classifications Manual and Non-Manual. Manual gestures are hand signs either static or dynamic. Non-Manual gestures are body postures, head movements, shoulder shrugging, and most of all facial expressions. Sign Language has a grammatical structure where facial expressions assume specific grammatical and affective functions. Facial expression in Sign Language Recognition differentiates lexical items and also participates in syntactic construction. In most countries, sign language is considered a minority language, and deaf individuals need a sign language interpreter [5]. The absence of interpreters typically causes a huge barrier to communication.
Automatic Sign Language Recognition (ASLR) is becoming essential to translate sign language into text or spoken words. Subsequently, recognition of Non-Manual gestures along with manual gestures can improve the ASLR accuracy rate as well as delivers the exact meaning of the gesture [6]. In several studies, it has been observed that ignoring facial expressions may lose of the exact affective state or morpheme of the language which fulfills the syntactic and pragmatic function for a better understanding. Facial expression recognition in sign language understanding is more complex than typical affective expression computing. It is a key challenge in non-manual markers of sign language using dynamic sign language recognition. Facial expressions in sign language recognition lie in six further categories: reflective expressions of emotions, constructed actions, conversation regulators, grammatical markers, modifiers, and lexical mouthing [7]. These types are further described in Table 1.
Types of Facial Expression
Types of Facial Expression
This paper presents towards recognizing affective facial expressions along with hand movement analysis in Spatio-temporal representation, where faces and hands are tracked in multiple frames to extract faces and hands as the regions of interest (ROIs) and further their features are concatenated and classified into seven classes representing disgust, neutral, happy, sad, scared, angry, and surprise expressions. Current study possess the following contributions: A hybrid manual and non-manual Spatio-temporal based multimodal deep learning framework targeting affective non-manual gestures. A non-independent multi stream model using upgraded YOLOv5 architecture, which can extract hand and face regions independently to consistent fine-grained features mining, and can represent the relationship among two local ROIs. Feature extraction and concatenation using refined C3D architecture. LSTM architecture to classify the multimodal spatiotemporal representation of gestures under low resource conditions.
Moreover, this paper is organized as: Section 1 introduces the dynamic sign language recognition using formal and non-formal modalities. Section 2, elaborates on the state-of-the-art studies related to hybrid multimodal sign language recognition and its foundations. In Section 3, the methodology of the proposed architecture MM-SLR is discussed in three modules of YOLOv5, refined C3D architecture and LSTM model. Datasets are also discussed in section 3. Later experimental results and implications of our findings are discussed in Section 4. Section 5 outlines the conclusion.
Sign Language Recognition with manual gestures can be traced back to years ago [8, 9]. Similarly, Facial Expression Recognition (FER) has been studied for decades in many domains [10, 11]. Emotion plays a significant role in natural communication. To perceive, translate and process human expressions in affective computing emotion recognition from facial expressions, eye movements, and body gestures plays an important role. Affective computing plays an important role in intelligent systems, online educational platforms, stress analysis, sentiment analysis, etc. another important domain of affective facial expression recognition is Sign Language (SL), which is formally known as non-manual sign language recognition [12]. Non-Manual Sign Language Recognition is a rapidly growing field in computer vision-based convoluted language. Sign Language Recognition is a natural language of low hearing or individuals with no hearing sense, it engrosses signs through hand movements as well as with the combination of facial expressions [13]. In recent years with the great progress in intelligent systems, emotion recognition is still the most encroaching problem for human interaction. Affective emotions are challenging because of contextual and psychological information understanding [14]. Nevertheless, several researchers tried to recognize facial expressions in videos based on computer vision technologies. On the other hand, numerous researchers are focusing hand gestures, formally known as manual sign language [15, 16]. But particularly in Sign Language, there is not enough literature that associates non-manual gestures with manual sign language understanding.
In the past, to fill this gap a research study of [17], provided a framework of manual and non-manual gestures combining together. They used Conditional Random Field (CRF) and BoostMap embedding for manual gesture segmentation. On the other hand for non-manual gestures they used Active Appearance Model (AAM) for facial features extraction and classified the gestures using SVM classifier. This study successfully combined physical and non-physical features of sign and gesture language. It also distinguished signs, finger spelling signs, and non-signed patterns. It is also vigorous to several sizes, rotations and left or right hand with the accuracy of 84%.
After that a sensor based approach is adopted by [18], where they worked on sign language recognition system considering facial expression with hand sign gestures using Leap motion sensor for hand signs and Kinect device for facial gestures. With a self-collected data of 51 dynamic gestures of word, the recognition is performed using Hidden Markov Model and Bayesian Classification Combination approach to classify complete gesture based on hand signs and facial features. Recognition performance rates were 96.05% for single hand and 94.27% for both hands.
In another study Keisuke and his fellow researchers developed a Facial Expression discriminate analysis method along with Fuzzy inference classifier, which used Kinect sensor for face expression analysis and flex sensor and accelerometer based gloves to analyze hand gestures. While working with 4 gestures they achieved the accuracy of 90.7%. Their prime concern was reading a teachers’ sign language and facial expression and automatically adjust some parameters of a computer based system [19].
In an alternative research facial expression based sign language architecture is proposed for Brazilian Sign Language in which facial action units are recognized using Facial Action Coding System (FACS) and further to classify basic seven gestures of neutral, happy, sad, disgust, fear, surprise, and anger CNN+LSTM has proved f1-score average to 0.87 [20] they also extended their studies in several variations of facial action units and several deep learning architectures [21]. Neural Nets are getting popular in latest researches and in another research [22] worked on neural sign language translation to fill the gap between the cross-modal understanding of signers and hearing people. They proposed two schemes, in first, multi-stream architecture was proposed where facial expression information is aggregated with mainstream information. In the second, scheme they extracted the facial features as Region of Interest (RoI) using a pre-trained architecture and further used classification techniques to translate sign language. Publically available a benchmark dataset RWTH-PHOENIX-Weather-2014T was used to validate the model. Experimental results show that performances are improved than state of art.
Similarly, in recent researches of [23] and [24] Deep CNN architectures are proven in excelling for multimodal manual and non-manual gestures by using device like Kinect, but as it comes to RGB data or data without depth the accuracy reduces to 80 to 85% from 90% and above.
However, with an in-depth literature review it is found that one of the great challenges of Sign Language Recognition is dealing of physical and non-manual parameters simultaneous within a particular framework or methodology whereas using dynamic sign language recognition with a unique grammatical structure. Very limited literature is found on combine frameworks of hand gestures and face expression detection, and all those are using 2 different devices or sensors along with 2 different modules separate for manual and non-manual gestures. Framework for formal and non-formal gestures simultaneously with a single vision based camera is required for working in a specific place with a given amount of time having a unique grammatical structure. Table 2 demonstrates a summary of latest research in dynamic formal and non-formal sign language recognition systems and its trials.
Manual and Non-Manual hybrid models literature summary
Manual and Non-Manual hybrid models literature summary
Our proposed methodology has three modules i) Modified YOLOv5 for face and hand ROIs extraction form the videos ii) refined C3D architecture for feature extraction for multi modalities and further concatenation and iii) LSTM network to get spatial and temporal descriptors and attention-based sequential modules for classification. The modified YOLOv5 act as preprocessing unit as the input video composed of multiple features but we are interested in the facial features as non-manual sign language descriptor and hand features as manual sign language descriptor. The face and hand ROIs are detected by modified YOLOv5 architecture and rest of the information is discarded. Extracted features of hand and face are acquired by C3D model, as C3D model retrieves 3D attributes and deeper information from frames. Furthermore, these acquired attributes of face and hand ROIs are concatenated. Finally, these long range acquired features are classified using LSTM model, by studying the temporal connections between video frames. The accuracy of classification, including temporal and spatial characteristics are successfully fused by using the combination of C3D and LSTM. The proposed architecture of complete system shown in Fig. 1.

System diagram.
There are primarily three core components of a one-stage identifier including its backbone, neck and head. The overall features of the input image with different sizes are extracted by using convolutional neural network of a backbone. The network layers of neck combine the previously extracted features by backbone network with predetermined rules to enhance the semantic facts of the features. YOLO here act as one-stage identifier and take object detection and regression as same where as the model only requires to execute one function on the entire input image by utilizing end-to-end neural network. After splitting the image into grid region, the network also predicts rectangular boxes for these particular regional grids. Grids determine the score and location for most of these rectangular boxes but only for those whose centroid lies within. A 5D output consisting coordinates along with confidence level is eventually produced for an individual rectangular box. Unfortunately, this inference rapid technique lacks in accuracy. On the other hand this version of YOLO has several advantages including faster inference, simplified architecture and user friendly interface.
With crucial modules like Focus, CSPbottleneck, SPP, and PANet, right now, YOLOv5 has a number of benefits regarding speed and precision. The backbone is chosen to be CSPDarknet53, which has the structure of several residual networks stacked on top of one another. For convolutional layers the mash acts as the activation function, for an individual Residual block body component inside the residual model passes via one down sampling as well as numerous residual models. The feature map output from the backbone is used as the model’s input by YOLOv5, which uses PANet as the neck of the structure. After this feature map is fused, features with more robust semantic data is created and supplied to the head network for identification. The PANet architecture is improved in terms of extracting features depending on FPN, which provides semantic as well as geographic knowledge. A bottom-up pyramid is introduced to the FPN architecture to transfer powerful coordinate features from the bottom level to the higher layer in order to supplement the FPN feature fusion [27]. The architecture of YOLOv5 shown in Fig. 2.

YOLOV5 architecture [27].
When it comes to object detection YOLO has gained a lot of attention recently. YOLOv5 utilizes down sampling in multiples of 8 times to identify and acquire more feature information of various size and shapes. As a result, a lot of positional data has been lost which makes small objects hard to find. Information from three different scales is merged to generate multiscale information that takes into account small targets’ semantic as well as position information through down sampling at various levels three outputs of YOLOv5 are generated. Each contain distinctive semantic as well as positional information.
The proposed model keeping the fact that most of the targeted objects in the dataset may be small. Before characterizing them in dimensions of channel up and down sampling is performed on y1 and y3. To get better extracted features, an attention module is embedded. GC block is used to improve the network even more. Feature fusion is advantageous in this case since it records cross channel dependencies. Then after performing the operations for up and down sampling y1’, y2’, and y3’ are generated as new output as shown in Fig. 3.

Improved module in YOLOv5.
C3D can concurrently simulate both type of information based on movement as well as visuals, and it can be trained on substantially large datasets. Beyond 2D convolutional networks, C3D expands and improves convolution models and learns from temporal directions in constant RGB frames excluding the preprocessing steps it can be trained on temporal relations and discriminative features both simultaneously. The input data is acquired by distinguished trainable features of C3D network which captures spatial and temporal characteristics automatically for classification [28].
Therefore, the proposed work uses C3D approach to extract features of the input video. The 3D kernel and pooling size is given as d x k x k, here kernel temporal depth is mentioned as d, and the spatial size as k. The calculation made for activation layer Leaky ReLU, 3D max pooling and 3D convolution layer explain below.
A 3D image is convolved to create a 3D convolutional cube by piling several adjacent frames at once. The feature maps are also generated by using the same method. Future layers are linked with the previous layers’ numerous consecutive frames. Specifically, the jth feature maps’ ith value at (x, y, z) position is determined by Equation (1)
here kernel size of the 3D convolution is denoted by Pi, Qi and Ri the number of prior layer feature map (i - 1) is denoted by m, the kth feature map value is
Leaky ReLU is capable of converting the input with the total weight. By a distinctive linear function, if positive input is given it will directly return the same output else the output is null. In order to produce nonlinearity leaky ReLU with nonlinearity is applied with ConvNet. 3D max pooling layer increase the reliability of the 3D CNN model, it gives certain degree of invariance to the feature of cube in time dimension. The max pooling formula is mentioned in Equation (2):
Here u and v are the input and output vector in the 3D pooling operation. The sampling size is given as S1 x S2 x S3 where the sampling set in three dimension (x,y,z) is (s,t,r). The proposed C3D module consist of five blocks as shown in Fig. 4. Both block 1 and block 2 consist of leaky ReLU function, a 3D convolutional layer and a pooling layer, the rest of the blocks contain two of each activation function, 3D pooling and convolutional layer. In all these five convolutional layers 64 to 512 number of filters are used. The proposed model uses kernel size of 3x3x3 during the convolutional process, in this way it helps the differences in spatial and temporal features to get acquired easily. These five blocks are parallel to one another in classifying different behaviors subjectively as well as categorize visual content at various level.

Proposed C3D feature extractor for face ROIs.
For pooling purpose 3D max pooling was applied. To keep the temporal information and to avoid blending the temporal signals the pooling layer uses a 3x3x3 kernel size. The same order was kept in the last layers which reduced the 3D CNN feature by 10 folds. When both higher and lower spatial resolutions are compared, they keep equal feature maps. Feature vectors have been generated by transforming the input frames, these vectors are capable of capturing motion information using various conventional layers and subsampling from input videos. Spatio-temporal characteristics can be considered as an output given by an individual convolutional layer. Many semantic and separated features are generated by higher layers by underlying the features by lower layers such as colors and edges. Among the five convolutional layers the last consist of larger receptive fields including invariant and discriminative features. Using the above mechanism, the extracted features of face ROIs as well as hand ROI are concatenated.
The high-level temporal attributes are modelled by using LSTM because low level temporal and spatial features can successfully be learned by using C3D. Softmax classifier is dropped from the classic C3D model and features are incremented to the last 1x1x1 layer of the LSTM model in actual practice. Figure 5 shows the architecture of LSTM.

LSTM network.
Here input signal, state of the cell and hidden state are represented by “x”, “c” and “where σ and tanh are the two-activation function. Forget gate, input gate and output gate keep the state of the LSTM cell updated by keeping the desires information from the history and removing the rest. The Equation (3) shows the calculation process.
In the Equation (3), bias and weight matrix are represented by W
g
and b
g
. The current timestep of the input state x
TM
is combined with the previous timestep of the hidden state hTM-1. The forget gate can be made to remove or keep the previous state the output of σ function ranging from 0 to 1. The input gate can be represented by Equation (5) independently, the process of updating the cell is shown in equation (6)
Equation (7) and Equation (8) exhibit the formation of the output gate.
Datasets
The study of manual and non-manual sign language recognition is not yet fully explored. Non-manual gestures specifically facial expression plays an important part in terms of emotion detection due to which it has gained a lot of attention. The proposed study used three different datasets to evaluate and train the model. All datasets; PkSLMNM [29], RWTH-PHONIX WEATHER 2014T [22, 30] and SILFA [31] are publicly available. Frames from all datasets were extracted and then annotated for training and testing purpose. Both C3D and LSTM were trained on extracted and annotated frames of these datasets individually, as the affective manual expression are similar to all but manual signs are different for all languages worldwide discussed in section 1 of this research study.
Implementation details
For implementation PyTorch deep learning framework is used. The model is processed on Ubuntu 20.04 with NVIDIA GeForce GTX 1080Ti and Intel(R) Core (TM) i7-3770 CPU with 3.40 GHz and 64GB RAM. The proposed work utilizes stochastic gradient descent for training advanced YOLOv5 in a point-to-point manner. With each regularization, weight of the models is updated every time, 64 batch size is set for training the model. Having 400 epochs both rate decay and momentum are set to 0.0005 and 0.910.
Evaluation and results
In Fig. 6 contains loss curves of training and validation. The graph represents trend of loss value that reduces drastically from the beginning of the first Iterations and becomes stable after training of 2000 alterations at 0.3426.

Error loss of improved YOLOV5.
Quantitative analysis to evaluate the proposed model is made using precision and recall. The formula to calculate precision and recall is mentioned in Equations (10).
Here the correctly detected and wrongly detected objects are shown by True positive (TP) and False positive (FP) whereas the correct objects which could not get detected were shown by False Negative (FN), The harmonic average between the calculated accuracy and recall is calculated by F1 score mention in Equation 11, to evaluate the performance of the model it is one of the important criteria.
To verify the performance of the proposed model, it has been test on the different databsets. An evaluation Table 3. is given below which shown precision, recall and F1 of the proposed model on multiple datasets.
Results on improved yolov5
LSTM is proposed for the classification of the video. To train and validate the model, 250 epochs are run on PkSLMNM dataset as well as on others. The model was trained on the extracted features from proposed C3D. The training and validation shown in Fig. 7.

LSTM training and validation accuracy using PkSLMNM dataset.
The proposed model achieves training accuracy of 83% and 79% validation accuracy as shown in Fig. 7. Compared with traditional model, it showed higher test accuracy, its training time also reduced two folds it occupied less GPU memory, didn’t showed any over-fitting issues when epochs were increased, which proved that is out performed the traditional model shown in Table 4. The sample result of the model shown in Fig. 8.

Sample results on proposed model.
Accuracy comparison of different models
Table 4 shows the evaluation of models based on their performances, the computation required to process Giga FLOPs (floating point operations per second) and parameters. The result show that the proposed model outperformed with better performance. The confusion matrix is plotted against a graphical 2D map shown in Fig. 9. 20 videos are used for validation among which the model identified true positive, false negative, false positive and true negative values successfully.

Confusion matrix.
In this current research, three modified architectures are combined together to provide a novel hybrid architecture MM-SLR to recognize non-manual features based on facial expressions along with manual gestures in spatial temporal domain representing hand movements in automatic sign language recognition. Experiments are conducted on three public SLT datasets and analyzed the results from multiple aspects and multiple levels. Average loss of modified architecture is 0.34 and stable after 2000 iterations. Further qualitative analysis is performed for all three datasets with the milieus of precision, recall and F1 score and model performs promising for all. Overall our model classifies the gesture based on manual and non-manual features using LSTM architecture and for PkSLMNM datasets the training and validation accuracy is 83% and 79% respectively. Moreover, an auxiliary experiment is conducted to estimate human pose estimation in affective domain to generalize the ability of our model combining manual and non-manual gestures. As a future direction we can update and evaluate our model with independent extracted non-manual features concatenated with global features; which gives prominence to hand movements as well as body movements which can be an important aspect of sign language understanding.
Funding statement
The authors received no specific funding for this study.
Conflicts of interest
The authors declare that they have no conflicts of interest to report regarding the present study.
Authors contribution
Conceptualization, S.J., and S.R.; Methodology, S.J., and S.R.; Software, S.J.; Validation, S.J. and S.R.; Formal Analysis, S.J.; Investigation, S.J. and S.R.; Resources, S.J. and S.R.; Data Curation, S.J.; Writing—original draft preparation, S.J.; Writing—review and editing, S.R.; Visualization, S.J.; Supervision, S.R.
