English sign language corpus with real-time mobile recognition and NLP for phrase interpretation

Abstract

Background

Recognizing sign language has emerged as a crucial method for closing communication barriers between the general public and hearing and speech-impaired individuals.

Purpose

In order to facilitate instantaneous recognition in mobile apps, this article presents an extensive collection of sign language in English.

Research Design

In contrast to earlier datasets, our collection includes a large number of signers, a wide range of hand gestures, and a diversity of environmental variables, guaranteeing reliable performance in a range of real-world situations.

Data Analysis

Our method improves real-time language of signs recognition and transforms gestures to speech with low latency by utilizing portable optimization approaches. High-resolution data, a specific English dataset, and dataset variety all work together to greatly enhance model generalization by eliminating errors imposed on by changes in user demographics, illumination, and hand placement as well as phrase interpretation using Natural Language Processing model.

Conclusions

By creating more accessible and inclusive mobile applications, our research opens the door for smooth sign-to-speech translation across linguistic and cultural borders.

Keywords

sign language recognition sign-to-speech conversion customized dataset NLP

Introduction

People with hearing difficulties can communicate with others in everyday life by using sign language, which is an essential communication tool. The absence of effective means of communication between non-signers and sign language users is still a major obstacle, nevertheless. Real-time conversion into spoken language is now feasible because of recent advances in computer vision and machine learning, which have automated the identification of sign language. In spite of the advancements in this field, a large number of sign language recognition databases now in use are lacking for the distinctive characteristics of various sign languages, especially geographical variances within languages and variations in gestures. In this work, a foundational dataset of English sign language that is tailored for immediate recognition within mobile applications is developed so as to address this issue.

By supplementing the English dataset, this one makes sure each of the languages are adequately represented for usage across the world. To improve the robustness of identification models, the dataset is constructed with a variety of hand gestures, various signers, and a range of ambient situations. The suggested system can reliably identify signs in real time by fusing advanced machine learning methods with high-resolution images and videos, thereby enabling smooth sign-to-speech interpretation. By making mobile applications accessible and useful for real-world applications, this research seeks to increase accessibility for users of English sign language.

Related work

Systems for recognizing sign language (SLR) have become essential assistive technology, facilitating smooth communication between hearing and deaf/hard-of-hearing people. In Ref. 1, input is provided as a sign language video, which is then framed and segmented. The Haar Cascade algorithm is utilized for sign detection, the CamShift approach is used for tracking, and P2DHMM is used for hand tracking. Following sign recognition, the POS tagging module receives the continuous words associated with the appropriate sign. The problem of hand overlap, which might compromise the precision of sign recognition, is not addressed by the system. Participants in the study are hearing-impaired students from Shankar Nagar Nagpur school. The system received 60 films as input, and 33 model feature expression samples were built into a database. The study aims to create NLP algorithms that can handle lengthy phrases; however, it ignores the vast database that will cover the most terms. Ref. 2 framework combines The Canny edges detection, region growth, SURF descriptors, clustering with K-means, and SVM classification to identify 16 classes of the American Sign Language alphabet with a mean accuracy of 97.13%. According to the study, a smartphone platform’s processing capability may be a limiting factor in its utilization. According to the study,³ the MobileNet method could find value in additional fields including object identification and picture categorization. The work had real-world implications for creating an algorithm that can efficiently categorize various sign languages. Although it did not involve NLP, the research also demonstrated the model’s potential for usage in practical settings. The report makes a number of recommendations for future research, such as investigating other deep learning techniques and using bigger datasets. Available estimates of human landmarks and hand tracking were made achievable by the usage of MediaPipe, in the paper,⁴ which also introduces the King Saud University Saudi-SSL (KSU-SSL) database, a Saudi Sign Language database, and employs a 3DGCN architecture for sign language recognition. The study⁵ seeks to address the difficulties in recognizing sign language, such as the deficiency of posture annotation and the requirement for efficient pre-training and fine-tuning techniques. The study has some drawbacks, such as the requirement for a significant quantity of labeled data and its dependence on pre-trained posture detectors. The study also points out that the posture detection quality may have an impact on the method’s performance. The work may have restrictions if the suggested approach is applied to other fields, such as computer vision or natural language processing. According to the findings of the study,⁶ existing video synthesis techniques lack the accuracy necessary for a fine-grained comprehension of the entire sign language phrase, but they do permit a certain level of understanding, such as how to identify the video category. It is used with the How2Sign dataset, which consists of a number of videos in sign language. The dataset’s restriction to American Sign Language as well as potential lack of generalizability to other sign languages are among the study’s shortcomings. The study also points out that gathering and annotating sign language data can be a costly and time-consuming process. There are real-world uses for the study⁷ in language instruction and education, especially when it comes to learning sign language. According to the study, augmented reality-enabled mobile applications may be utilized to foster social understanding and awareness while offering inclusive and accessible learning opportunities for those with hearing impairments. Due to the small sample amount of thirty users and particular setting, the research could not be entirely typical of the general population. The suggested model has an extensive computational complexity, according to the study.⁸ The quantity of the dataset is one of the study’s shortcomings. The intricacy of the model is another drawback of the study. Among the study’s shortcomings is the requirement for additional data to enhance the model’s functionality. It is necessary to solve the transformer constraints and computational complexity of the model. Using a trained convolutional neural network (CNN), the study⁹ suggests a Rotation, Translation, along with Scale-invariant sign word-recognition system. However, the study notes that because of the CNN model’s computational complexity, the suggested system might not be appropriate for real-time applications. The study’s shortcomings include its inability to handle RTS variable pictures and complicated backdrops, which impair system performance. The efficiency provided by the MediaPipe framework in hand gesture detection is shown in the paper.¹⁰ Without the use of wearable sensors, the system can instantly identify motions in sign language. The constraints of the suggested system are not thoroughly covered in the paper. The need for good-quality sign language datasets along with the possible complexity of the suggested model are two of the study’s shortcomings. In order to maximize feature representation, the study¹¹ suggests an Adaptive Transformer architecture that combines three innovative modules: Adapted Masking, Local Clipping Self-Attention, with Adaptive Fusion. Future research will examine how well ADTR generalizes across other sign languages, improves spatial learning for representation, and adapts to new sign language datasets and languages. By creating automatic systems for translating among spoken and signed languages, the work in Ref. 12 aims to enhance communication between the Deaf and hearing populations using noninvasive technology. The goal of the project is to solve the difficulties in using avatars to show and convey signed language. However, future research in this field will focus on creating increasingly complex avatar technology and more sophisticated methods for conveying signed language. The proposed system offers Sign Language Recognition for a single sign to a whole phrase as well as offers interesting different learning methodologies through a catchy mobile application.

Methodology

The proposed mobile application includes an efficient text-to-sign and sign-to-text using sign language for an e-learning purpose that has been upgraded with fascinating augmented reality visuals for sign language that also offers real-time practice choices. Figure 1 shows the general data flow of the application.

Figure 1.

Application data flow diagram.

Public datasets frequently don’t cover specialist topics or use cases, have quality problems like missing numbers or imbalances, and lack domain-specific data. Building a customized dataset gives users greater flexibility over the quality of the data, ensuring that it satisfies specific requirements and enhances the precision and dependability of your model.

Dataset

The system is developed using a methodical, systematic process to guarantee accuracy and efficiency. To start, a complete dataset is created by collecting and annotating sign language motions. To make model training easier, each move is labeled and classified. Using the data with annotations, machine learning models are then trained to effectively detect and categorize sign language motions.

Approach 1

Using MediaPipe¹³ as the selection of features tool by which saves and captures hand landmarks for every sign as in Figure 2, then traditional machine learning algorithm can be trained and tested for validation purpose, the following steps in Figure 3 describe how to create a personalized, high-resolution datasets.

Figure 2.

MediaPipe hand landmark.

Figure 3.

Custom numbers dataset samples.

The following steps are for building a customized sign language dataset using approach 1:

• Begin collecting data.

• Verify the availability of the data directory.

• If false, then Create a dataset directory.

• If true, then continue iterating over every sign class.

• Examine the class’s current images.

• Proceed to the following class if the dataset size is achieved.

• Encourage the user to begin recording

• Take images and store them for each class.

• Image processing using MediaPipe

• Find landmarks and detect hands.

• Store and normalize landmark data.

• Repeat the above steps for every class.

• End Data Gathering

The dataset consists of two parts: the first includes five ML models for gestures such as “Hello,” “I Love You,” “Yes,” “No,” “Like,” “DisLike,” and “GoodBye,” with 420 files organized into 7 folders, each containing 60 samples, totaling 17.8 MB.

The second part involves BiLSTM¹⁴ and Bidirectional GRU¹⁵ models using MediaPipe for numbers 0–10, with 2200 files in 11 folders, each containing 200 samples, and a total size of 120 MB. Hand landmarks tagged with signs are used to offline train the MediaPipe ML model. The model uses MediaPipe to identify hand landmarks during runtime, then extractions and normalized the landmark points for creating a feature vector (making sure that the model’s place within the frame has no impact on predictions). After predicting the matching gesture, the model uses a labels dictionary to translate its prediction into a word. Last but not least, the word is uttered and shown in real time. For a dataset of numbers 0–10 as in Figure 3, the BiLSTM and Bidirectional GRU models with MediaPipe come with 2200 files arranged in 11 folders, each of which has 200 samples. The dataset has a total size of 120 MB.

To capture the essential joint movement and spatial relationships of each motion, the datasets employ hand landmarks with the help of ten preprocessed and standardized signers. This scale-invariant format allows for the efficient training of deep recurrent architectures with a limited number of samples by telling the model to learn the basic structure of the signs while ignoring superfluous background or positional variation. Consistently collected and balanced by class, the data promotes steady convergence and reduces intra-class variability. This approach is inherently extensible since the same landmark recognition pipeline can be used to instantly gather data for more signs and users. Therefore, successful performance on this monitored dataset points to a strong, widely applicable foundation for an expanding sign collection in the future.

Approach 2

Using deep learning techniques, including Convolutional Neural Network and the related models, to extract features from images with and without MediaPipe features is an alternate strategy. The purpose of these models is to automatically extract complex information from the input images. To ensure that the dataset is adequately represented and that the model can effectively generalize to new data, its extracted features are subsequently verified through training and testing.

The depicted numbers range from 0 to 10. While, the dataset for the hybrid VGG16¹⁶ with BiLSTM model, is used without MediaPipe, includes gestures for letters A-Z as in Figure 4 and a blank (no hand gesture). It consists of 8426 files organized into 27 folders, with each folder containing 301 samples, except for the blank gesture folder, which has 600 samples. The total size of this dataset is 90.5 MB. The generalization capacity of our dataset is enhanced by its acceptable signal-to-noise ratio, which is the result of well-organized data collection with the help of 10 signers that eliminates environmental volatility. This enables efficient acquisition of fundamental gesture patterns using a limited quantity of examples. Further enhancing generalization, the pre-trained weights of VGG16 provide transferred feature extraction for hand forms, while the BiLSTM layer depicts temporal relationships across frames. This hybrid approach combines spatial feature transfer and temporal pattern learning to enable the model to learn powerful recognition abilities. Therefore, its effectiveness on the manipulated dataset confirms that it can generalize even with the current sample size by showcasing basic understanding that can expand to larger vocabulary and real-world contexts.

Figure 4.

Custom letters dataset samples.

The steps below are involved in the preprocessing steps without MediaPipe to utilize the image for the Composite VGG16 along with BiLSTM processing, to build a customized sign language dataset using approach 2.

• Start the Program

• Establish Directories

• Path: blank SignImage 224 × 224/A–Z

• Set up the Webcam, select a frame, and draw the ROI (Region of Interest). Extract and Process the Rectangle on the Frame ROI: Make the conversion to grayscale

• Set size 224 × 224.

• Show Real-time Stream & ROI Processed

• Verify the Key Press: A–Z → Save in the appropriate letter folder

• Click ‘1' to save in the ‘blank’ directory.

• Hit Esc → Leave

• Close the window and release the webcam.

Phrase interpretation

Utilizing three pre-trained models to recognize sign language in real time per frame, Keeping track of each model’s letter, number, along with word predictions for each sign, To improve final interpretation and resolve context, the entire video sequence is input to an NLP model at the final stage as shown in Figure 5. Using the BART¹⁷ NLP model, which takes the sequence as input, interprets the context, and then chooses the best prediction for each sign based on the context rather than trying every possible combination because each sign has three predicted tokens that are based on the language and context, as shown in the comparable Table 1. On the other hand, a Pre-trained languages model scores using a brute force method which tries every potential phrase combination without producing text, then determines which sentence sounds the most probable or natural. BERT,¹⁸ a pre-trained language model, is able to give a sentence a likelihood or log-probability. This “score” indicates the likelihood that a human linguist would use this statement in a natural setting. After scoring each combination, choose the one that comes with the highest probability, also known as the lowest loss.

Figure 5.

Phrase prediction flow diagram.

Table 1.

Compare different models for phrase interpretation.

Feature	Transformer (BART)	Seq2Seq (GRU)	Brute force LM (BERT)
Architecture	Transformer (Encoder-decoder)	Recurrent Neural Net (Encoder-decoder)	Bidirectional (Encoder only)
Parallelism (self-attention)	Yes (use self-attention)	No (sequential)	Yes (but, tests all combinations as inputs)
Pre-trained	Yes	Rare (unless use Embeddings)	Yes (Uses pre-trained LM for scoring)
Number of Combinations Tried	1 (best path)	1 (greedy path)	k^n (tries all paths, k signs number and (n = 3) 3 datasets categories)
Real-Time Capable	Yes	Possible (but slower)	Not suitable

Mobile application

Front-end

This Flutter application facilitates seamless navigation, Firebase user authentication, and instantaneous sign language interaction through video input. After an animated SplashScreen, there is a WelcomeScreen with sign-in and sign-up choices. The app’s dynamic and flexible layout allows for safe registration of new members and provides quick accessibility to Home, Categories, Chatting, and Settings through a bottom tab bar. Three key interactive features are offered by the application’s frontend: real-time sign-to-text phrase detection using a language model, augmented reality for text-to-sign conversion in real-time, and e-learning modules to assist with individual signs. Along with an interactive quiz-style game that assesses the user’s ability to recognize signs by choosing the proper interpretation, the e-learning area offers users short, offline films that are organized by sign kinds, allowing them to study at their own entertainment.

Back-end

Firebase’s¹⁹ real-time features, secure user authentication, and smooth integration make it an excellent choice for this Flutter-based signing application. The application manages user registration and login with ease using Firebase Authentication, while Cloud Firestore acts as a powerful NoSQL database to save and sync user data, such as phrase history or sign inputs. Key actions including adding, retrieving, modifying, and deleting records are supported by each user’s unique data collection, which is recognized by their UID. A dynamic, synchronized user experience and effective data management are guaranteed by this configuration.

E-learning phase

To assist users in learning the signs, the program offers E-learning for every sign category through brief offline videos. Additionally, it has a training quiz that tests users’ ability to choose the right response for the sign that is presented.

Augmented reality

The initial step in integrating Augmented Reality (AR)²⁰ with Flutter software is to use a Unity-Flutter plugin, including flutter_unity_widget, to connect Unity’s AR features with Flutter. Start by using Unity’s AR Foundation to produce interactive augmented reality material in order to construct the AR experience. Whenever the Unity project is completed, export it as a platform-specific release (such as an iOS or Android build) or as a Unity package.

Real-time practice phase

Real-time testing is made possible by the application, which lets users enter sign language gestures as whole phrases or by dataset classification (letters, digits, or words). The BART NLP model is then used to construct the final interpretation.

Results

Dataset

The results of applying the different ML approaches in Figure 6 on the gestures English expressions of approach 1. Prediction of English Sign Words as samples are shown in Figure 7.

Figure 6.

Accuracy of applying different ML.

Figure 7.

Prediction of English sign words.

Figure 8 shows the prediction accuracy samples of the alphabetic dataset by the hybrid VGG16 with BiLSTM model and overall train and validation accuracy and loss in Figure 9.

Figure 8.

Prediction of English sign letters.

Figure 9.

BiLSTM & VGG16 train and validation accuracy and loss.

The following involves the BiLSTM and Bidirectional GRU models using MediaPipe for numbers with dataset numbers ranging from 0 to 10 as shown in Figure 10 and the training and validation accuracy and loss are illustrated in Figure 11.

Figure 10.

Prediction of English sign numbers.

Figure 11.

BiLSTM & GRU & mediapipe train and validation accuracy and loss.

Phrase interpretation

In real-time phrase prediction, a prediction is output for each model due to its trained dataset giving three predictions of the nearest word, number and letter as shown in Table 2. Then, the following prediction at each frame is chosen from using the BART algorithm and language rules to avoid the brute-forcing of trying all combinations as illustrated in Table 3.

Table 2.

Example of each sign prediction at separate frames.

Frame	Letter	Number	Word
1	I	1	Hello
2	L	7	Like
3	Y	3	Yes
4	D	4	Dislike
5	G	9	Goodbye

Table 3.

Example of each sign prediction at separate frames.

Frame	BART selected	Why
1	I	Likely start of personal sentence
2	Like	Follows “I” meaningfully
3	You	Inferred by BART from context
4	Skipped	“Dislike”/“4”/“D” break flow
5	Goodbye	Natural closure

Using a test set of 100 signed phrases that included greetings, self-introductions, and basic assertions, we assessed our end-to-end system. The system’s BLEU score of 78.2 indicated that it could not only identify signals but also use the sequence to create cohesive English sentences. This result shows the fact that the NLP module (BART) generates fluent output and successfully corrects for small recognition failures.

Mobile application

Front-end

The user first register for creating his account for future sign in for the proposed mobile application.

Back-end

The sign-up phase sends to the firebase the user information to be saved.

E-learning phase

There are many offline learning sign language categories as shown in Figure 12, each can display an offline video for the sign for learning purposes as in Figure 13. After the user studies the signs he can tackle a quiz for testing his learning through online detection of the presented sign and an overall score is calculated with illustrations of his mistakes and customized advice for selective learning as shown in Figure 14.

Figure 12.

Offline sign language learning.

Figure 13.

Learning signs of days of the week.

Figure 14.

Learning signs quiz and correction feedback.

Augmented reality

Figure 15 shows the use of AR in learning sign language; when any of the pre-saved phrases is recognized using the mobile camera, the related pre-saved sign image is displayed automatically.

Figure 15.

Augmented reality E-learning converting text to sign.

Real-time practice phase

The user can use his real-time camera to predict each sign separately as shown in Figure 16. The system operates on modern smartphones, capturing 1080p video at 30 FPS to accurately track hand movements, where the sign recognition runs locally for minimal delay.

Figure 16.

Real-time sign language prediction.

Conclusion and future work

A thorough dataset of English sign language is presented in this paper, enabling real-time sign language detection for mobile apps. The research improves the overall efficiency of sign-to-speech translation systems. For real-time applications, the combination of deep learning algorithms with high-resolution data guarantees great accuracy and efficiency. Additionally, the use of this technology may greatly increase accessibility, particularly in educational facilities where listening-impaired people can benefit from enhanced learning resources.

Future research will concentrate on adding more sign languages to the dataset, integrating gesture detection for non-manual cues (such facial expressions), and enhancing the mobile app’s real-time processing capabilities. Multi-modal learning integration might be investigated further to enhance sign comprehension in various circumstances. Additionally, improving the system’s performance on low-end smartphones and tablets will be essential to expanding the technology’s accessibility across a range of geographical areas. Although creating a fully working application that benefits the Deaf population is the ultimate aim, the current study mainly focuses on assessing the system’s technological structure and algorithmic performance. Future study will include extensive user studies with Deaf individuals and thorough usability testing.

Footnotes

Acknowledgments

The author expresses their heartfelt gratitude to the individuals who generously contributed their time and effort to the experiments in this research. Special thanks go to Arwa Mohamed Ayman Mekawy, Retag Mamdouh Masoud, Yassmin Hassan Mosad and Rahik Osman Khalaf Osman from the Faculty of Computer Science and Artificial Intelligence at Pharos University in Alexandria, Egypt. Their invaluable participation and dedication were integral to the practical validation and success of this study.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data is contained within the article.*

References

Wazalwar

Shrawankar

. Interpretation of sign language into English using NLP techniques. J Inf Optim Sci 2017; 38(6): 895–910.

Jin

Omar

Jaward

. A mobile application of American sign language translation via image processing algorithms. In: 2016 IEEE region 10 symposium (TENSYMP), Bali, Indonesia, 09–11 May 2016. IEEE, pp. 104–109.

Abu-Jamie

Abu-Naser

(2022). Classification of sign-language using MobileNet-deep learning.

Alsulaiman

Faisal

Mekhtiche

, et al. Facilitating the communication with deaf people: building a largest Saudi sign language dataset. Journal of King Saud University-Computer and Information Sciences 2023; 35(8): 101642.

Zhao

Zhou

, et al. BEST: BERT pre-training for sign language recognition with coupling tokenization. Proc AAAI Conf Artif Intell 2023; 37(3): 3597–3605.

Duarte

Palaskar

Ventura

, et al. How2sign: a large-scale multimodal dataset for continuous American sign language. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2735–2744.

Cabanillas-Carbonell

Cusi-Ruiz

Prudencio-Galvez

, et al. Mobile application with augmented reality to improve the process of learning sign language. 2022.

Shin

Musa Miah

Hasan

MAM

, et al. Korean sign language recognition using transformer-based deep neural network. Appl Sci 2023; 13(5): 3029.

Miah

ASM

Shin

Al Mehedi Hasan

, et al. Rotation, translation and scale invariant sign word recognition using deep learning. Comput Syst Sci Eng 2023; 44(3): 2521–2536.

10.

Halder

Tayade

. Real-time vernacular sign language recognition using mediapipe and machine learning. International Journal of Research Publication and Reviews 2021; 2: 2582–7421.

11.

Said

Boubaker

Altowaijri

, et al. Adaptive transformer-based deep learning framework for continuous sign language recognition and translation. Mathematics 2025; 13(6): 909.

12.

Wolfe

Braffort

Efthimiou

, et al. Special issue on sign language translation and avatar technology. Univers Access Inf Soc 2025; 24(1): 1–3.

13.

Lugaresi

Tang

Nash

, et al. Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172. 2019.

14.

Touloupis

. Natural language processing and text classification with Bert & BiLSTM model. 2024.

15.

Nosouhian

Khoshouei

(2021) A review of recurrent neural network architecture for sequence learning: comparison between LSTM and GRU.

16.

Mascarenhas

Agarwal

. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for image classification. In: 2021 International conference on disruptive technologies for multi-disciplinary research and applications (CENTCON), Bengaluru, India, 19–21 November 2021. IEEE, vol 1, pp. 96–99.

17.

Venkataramana

Srividya

Cristin

. Abstractive text summarization using bart. In: 2022 IEEE 2nd Mysore sub section international conference (MysuruCon), Mysuru, India, 16–17 October 2022. IEEE, pp. 1–6.

18.

González-Carvajal

Garrido-Merchán

. Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012. 2020.

19.

Chougale

Yadav

Gaikwad

, et al. Firebase-overview and usage. Int Res J Mod Eng Technol Sci 2021; 3(12): 1178–1183.

20.

Chen

Liu

Cheng

, et al. A review of using augmented reality in education from 2011 to 2016. In: Innovations in smart learning. Innovations in smart learning, 2017, pp. 13–18.