Abstract
The government has established special schools to cater to the needs of children with disabilities but they are often segregated rather than receiving equitable opportunities. Artificial Intelligence has opened new ways to promote special education with advanced learning tools. These tools enable to adapt to a typical classroom set up for all the students with or without disabilities. To ensure social equity and the same classroom experience, a coherent solution is envisioned for inclusive education. This paper aims to propose a cost-effective and integrated Smart Learning Assistance (SLA) tool for Inclusive Education using Deep Learning and Computer Vision techniques. It comprises speech to text and sign language conversion for hearing impaired students, sign language to text conversion for speech impaired students, and Braille to text for communicating with visually impaired students. The tool assists differently-abled students to make use of various teaching-learning opportunities conferred to them and ensures convenient two-way communication with the instructor and peers in the classroom thus makes learning easier.
Introduction
The education system plays a very crucial role in the overall growth and development of a country. India, one of the fastest developing countries globally, took cognizance of the situation and underwent many reforms in strengthening its education system. It is believed that education is not a privilege but a right of everyone. The students with disabilities may wish to contribute to the development of the country, but they generally do not get equitable opportunities. An analytical report published by enabled.in mentioned that 2.21% (2.68 Cr) of the total population (121 crores) has disabilities, out of which 19% have a visual impairment and 19% have a hearing impairment, and another 8% have multiple disabilities [1]. The Indian government has implemented several policies for the last two decades concerning educational reforms for the disabled. One of these policies is “Sarva Shiksha Abhiyan” (SSA), aiming at free education for disabled children of 6– 14 years of age. Free inclusive education is an integral part of SSA to provide “education to all”. Inclusive education is a learning environment where children with and without disabilities study in the same classroom [2]. The purpose of inclusive education is to make classrooms impartial to all students regardless of any disparity. Inclusive education is a birthright for every child and not a privilege. The Individuals with Disabilities Education Act (IDEA) clearly states that all children with disabilities should be educated with non-disabled peers and have access to the general education curriculum. This act includes identifying the need for a comprehensive environment that sensitizes to the needs of children with disabilities in educational institutes. Most schools still do not practice inclusive education and are not sympathetic to the disabled [3]. A major barrier could be the unavailability and affordability of special equipment to implement inclusive classrooms. Sensitizing with the needs of the disabled will not only improve the education system but also enable them to be a more inclusive section of society where they could realize their true potential.
Several items, equipment, or product systems, known as Assistive Technologies (ATs), are available to increase, maintain, and improve the functional capabilities of individuals with disabilities. These technologies have helped children with disabilities to pursue their education. The assistive product industry is currently limited and specialized, primarily serving high-income markets. There is a lack of state funding, nationwide service delivery systems, user-centered research and development, procurement systems, quality and safety standards, and context-appropriate product design for implementing assistive products and technologies. Some of the popular ATs and other commercially available devices that use text to speech systems (TTS) and Optical Character Recognition (OCR) are tabulated in Table 1, along with their approximate prevailing costs [4]. The high cost of the commercially available ATs, as can be seen from Table 1, discourages its utilization amongst a large section of society. These are even beyond the reach of the majority of the students with special needs.
Available AT devices and their costs.
Available AT devices and their costs.
An important aspect to realize an inclusive classroom is to provide an atmosphere where students are given equitable opportunities to express themselves freely. The students with disabilities usually take a while to adjust with other normal students and often require a different set of costly aids and tools to be more participative. The teacher also needs to implement inimitable teaching ways [6] for an effective and inclusive classroom.
To facilitate the teaching-learning in an inclusive classroom, an economical and consolidated Smart Learning Assistance (SLA) tool is proposed in this work with the following characteristics: This proposed tool encompasses 4 modules specific to an instructor, speech-impaired, visually-impaired, and hearing-impaired students. The procedures integrated into the tool include Sign Language to English text, speech to English text, speech to Sign language, and Braille to English text conversion to make classrooms more inclusive for students with speech, hearing, and visual impairments. The tool utilizes various open-source software such as Google Colaboratory, Jupyter notebook, etc., and Computer Vision techniques such as Convolutional Neural Network, and Image Processing to make it more economical and within the reach of a common student. The tool bridges the communication gap between instructors and students with disabilities, thus refuting the need for special classrooms.
The organization of the paper is as follows: Section 2 presents the related work reported in the literature. Section 3 demonstrates the research methodology while section 4 presents validation of the proposed tool. Finally, section 5 concludes the paper.
In 1974, Assistive technologies (ATs) were first invented to aid individuals with visual, speech, and hearing impairments [7]. Since its inception, it reasonably justified its cause, but it had the significant drawback of being extremely expensive. However, developments in Artificial Intelligence techniques made it possible for ATs to be implemented in less complicated devices that are moderately priced, such as Talking Calculators.
One of the very famous devices used for hearing deficit students is I-communicator [8]. It is an application developed by Interactive Solutions Inc. (ISI) for the translation of speech to sign language. It can be used as a dictionary to search for definitions [8]. Another useful device for students with hearing impairment employs bone conduction. It uses the conduction of sound to the inner ear primarily through the bones of the skull, allowing the listener to perceive audio content without blocking the ear canal [9]. It is an alternative to a regular hearing aid for those with problems in their outer or middle ears. It is useful for conductive and mixed hearing losses. A bone conduction hearing device relies on a working cochlea, which is the hearing organ in an inner ear, to send sound to the brain [10]. These devices may provide fluent two-way communication between the student with hearing impairment and the instructor in a classroom.
For students with speech impairment, Google AI Labs has developed an algorithm capable of tracking the movement of the user’s hands with the camera on a mobile device. The algorithm requires, however, a considerable amount of resources for processing data and thus has a hefty response time. To reduce the resource requirements, the position and size of the palm are taken into consideration for the detection of movement in place of the entire hand and is more distinct and regular [11]. This way, speech impaired students can easily communicate without any additional hardware.
To address speech disability, an innovative technology using Artificial Intelligence was invented by a group of students at the University of Washington, known as SignAloud [12]. It incorporates a pair of gloves that transliterate American Sign Language (ASL) into English. The gloves have sensors that track the user’s hand movements and then send the data to a computer system via Bluetooth. The computer system analyses the data and matches it to English words, which are then spoken aloud by a digital voice [12]. The students can use these gloves to communicate using sign language gestures of their hands, and the instructor can interpret it by the digital voice.
As visually impaired students are comfortable using Braille to read and write, Braille translators and screen readers like SMART Brailler [13] have proved to be very beneficial for them. Braille translator is a software program that translates a script into Braille cells and sends it to a Braille embosser, which produces a hard copy in Braille script of the original text. The braille translator makes it convenient for both the instructor and the visually impaired student to communicate, as the instructor reads the message in the English language, and the student conveys their message using Braille. The SMART Brailler is an electronic device that has a video screen and provides audio feedback. It displays and speaks letters and words as a student brailles them, which provides instant feedback to a student, allowing him or her to work independently. The video screen on the device is designed to allow parents of children with visual impairments and teachers who educate such students in inclusive classrooms to observe visually in Roman letters what students are writing in braille.
The solutions mentioned above are devised to assist differently-abled people, each targeting a specific type of disability. Moreover, these solutions are not affordable to a lower stratum of society. Therefore, a cost-effective and integrated Smart Learning Assistance (SLA) tool is envisioned for students with different disabilities so that they could share the same classroom with regular students.
Research methodology
The objective of the SLA tool is to help students with various impairments to increase their engagement in the class. This tool bridges the communication gap between the instructor and disabled students so that they could seamlessly interact with their classmates as well as the instructor. The proposed tool is intended for different types of specially-abled students, viz. visual impaired, hearing impaired and speech impaired students, and the instructors. The detailed working of the tool is presented in Fig. 1.

Working of the SLA Tool.
To ensure the desired function of this tool, the instructor should have a microphone throughout the lectures to avoid background noises, and the webcam should be enabled in the computer system or digital device in which the SLA tool is installed.
The SLA tool is designed using HTML 5, CSS, and JavaScript, and has been implemented using Python 3.7.3. The Graphical User Interface (GUI) of the tool is shown in Fig. 2. The working of the tool was tested both on Windows 10 and mac OS Catalina. It supports both English and Hindi languages.

GUI of the SLA Tool.
Each of the three categories of differently-abled students and the instructor has a different use of this coherent tool. Hence, separate modules or modes of operation for various disabilities are available to address their distinct needs. The tool provides 4 options/modes: Instructor module or Mode 0, Hearing-impairment module or Mode 1, Speech-impairment module or Mode 2, Visual-impairment module, or Mode 3. The tool operates by giving voice commands to the user, which increases its accessibility and responsiveness. The voice commands are primarily used for visually impaired students. The users can choose the appropriate module or mode as per their requirements. The function of each of these modules or modes is explained in the following subsections.
In this mode, the instructor shares his screen with the students by clicking the “Share Screen” button in the tool. The screen is visible to all the students connected to a local area network. The instructor then chooses the output format of his/her audio input as text or sign language gesture, taking into cognizance of students’ disabilities.
Speech to text conversion
In this mode, if the instructor chooses text as the output format, the audio input of the instructor is converted into the English text using the “Speech Recognition” API (Application Programming Interface) and PyAudio library of Python, which requires an active internet connection. It can also be used in microcontrollers such as Raspberry Pi with an external microphone.
After selecting this option, the instructor starts the lecture and his/her voice is captured by the microphone. The input audio is then recognized by the PyAudio library, which is converted into English text and is made available at the instructor’s screen or device. As the instructor’s screen is shared with the students, they can view the transcribed speech on their devices. This will enable the speech and hearing-impaired students to follow the instructions.
Speech to sign language conversion
The audio input of the instructor is converted into Sign Language gestures as images or GIFs (Graphics Interchange Format) if the instructor selects the output format as Sign Language gestures. Sign Language GIFs of some general conversation dialogues are included in the database. So, if the input of the teacher received by the microphone, is recognized by the PyAudio library, the tool checks the corresponding audio transcription in the database and searches for the corresponding GIF in the database. If the Sign Language GIF for that dialogue exists in the database, it is displayed in a new window using the Tkinter package. Tkinter is the standard GUI library of Python to create GUI applications. In case, the match is not found in the database, the audio input of the instructor’s speech is fragmented into alphabets and converted into corresponding Sign Language gestures as shown in Fig. 3. The database contains gestures of alphabets in the form of images. These images are displayed on the screen of the instructor as well as students, in sequence, corresponding to the spelling of the words spoken by him and assists the hearing impaired students to follow the instructions.

Conversion of a spoken word “here” into its Sign Language gestures.
In general, students with impaired hearing are proficient in reading the text and if they also lack speech ability, they can use Sign Language as a mode of communication. So, whenever the instructor gives any instruction either by text message or by Sign Language gestures, the instruction is displayed on the shared screen. This mode helps students who are hard of hearing to communicate with the instructor.
Speech-impairment module or mode 2
Students with speech impairment can communicate through written text, but they generally prefer to communicate through Sign Language [14]. This tool ensures that the student is at convenience to use Sign Language for communication. To start the conversation, students have to click on the “Start the conversation using sign language” button, then the movements of their hands and other gestures are captured by the front camera of their device. These gestures are converted to readable text and enable the instructor to communicate with them. The Sign Language to text conversion is implemented using Convolutional Neural Networks (CNN) model. A database of 2-D images is created for training the model and the process is described as follows.
Database creation
A directory structure is created for each category of the sign gestures for training the CNN model. Several images are taken for each 26 sign gestures of the alphabets, 10 digits, and some common phrases to achieve high accuracy for sign recognition. For example, to train the CNN model on the Sign Language gesture for the alphabet “A”, the directory contains 100 images of the same. All the sign gestures are based on the Indian Sign Language convention [15].
Image acquisition
The front camera of a device is used to capture the input in the form of Sign Language gestures from the speech impaired student. The OpenCV library is used for accessing the front camera and image processing. The best images are captured in different lighting conditions and at different angles on a plain white/light-colored background. The gestures captured as an image are adjusted at a distance to get the desired image clarity. These images are pre-processed before storing them in training and testing databases.
Image pre-processing
The image taken from the front camera of the device is converted to a grey-scale image, i.e. the RGB image is converted to a black and white image. Then background subtraction is performed to extract the foreground image from the background image. The black and white images, thus obtained, are used to train the CNN model and are called Image Frames.
Training the CNN model
The Sequential model is utilized to train the CNN model. The training input consisting of filtered and shrunk images originally stored in 2-D format are flattened to 1-D to pass to the dense layers. Further, the model contains three additional hidden layers to identify the latent relationship of the features from the flattened data. The actual image classification happens in the output layers, which is the final layer. The two dropout layers with a 50% and 30% probability of deactivating the hidden units respectively are added to reduce the model’s over-fitting. Since this is a labeled categorical classification, the Softmax activation function is used in the final layer.
The CNN model to convert Sign gestures to English Text was implemented using Python in Google Colaboratory using runtime type as GPU (Graphical Processing Unit). The code snippet for the same is shown in Fig. 4.

Code snippet of CNN Model Building.
After experimenting several times using different values of the hyper-parameters, the model was fine-tuned over the optimized parameters given in Table 2. The summary of the resultant CNN model with layers using various parameters is given in Table 3. To validate the classification of data using the CNN model, a random sample of 21 classes consisting of some alphabets and numbers was chosen. Figure 5 depicts the confusion matrix which shows the comparison of true versus the predicted classes of the images performed by the trained CNN model. Various metrics such as Accuracy and Loss on training and validation data sets were evaluated. Accuracy is defined as how accurately the tool can predict the true data while Loss is a measure of not classifying the true data.

Confusion Matrix (True labels vs Predicted labels).
Hyper-parameter tuning of the CNN model
Summary of compiled CNN model
The values of the metrics Training Accuracy, the Validation Accuracy, the Training Loss, and the Validation Loss for the trained CNN Model were evaluated and results are presented in Table 4. The confusion matrix presented in Fig. 5 visualizes and confirms the results obtained in Table 4. Fig. 6a) and Fig. 6b) shows the training and the validation accuracy as well as the training and the validation loss against the number of epochs. As it can be observed from these Figures that after 45 epochs, the curve for the accuracy as well the loss is flattened hence, and no major improvement was observed after 45 epochs. Therefore, to prevent the model from overfitting, the optimal choice for the number of epochs was taken as 45. As the model is trained, the image frames of the gestures are converted into readable English text and sent to the instructor’s device. The output text in the form of a message pops up on the screen of the instructor.

a) Training and Validation Accuracy curve b) Training and Validation Loss curve.
Experimental results after the model evaluation
This way, students with speech impairment can communicate with the instructor and express themselves effortlessly during the class.
Visually impaired students are skilled in reading and writing using Braille, and with the advancement in technology, they can even type on any device using braille keyboards. The students can start the visual-impairment module through the voice commands integrated into the tool. They can enter the braille text as an input to the tool using their braille keyboards that are converted into English text for communicating with the instructor.
Braille to english text conversion
A dictionary of grade 2 Braille [16] is created to translate the Braille input into the English text. Braille signs are mapped to alphabets, digits, contractions, and punctuations. In grade 2 Braille, a cell can represent a shortened form of a word. Many cell combinations are created to represent common words, making this the most popular of the grades of Braille. There are part-word contractions, which often stand-in for common suffixes or prefixes, and whole-word contractions, in which a single cell represents a complete commonly used word [17].
Students are instructed to use an escape sequence if they wish to enter any numeric input (0– 9). Digits in Braille use the same symbols as it has for the first 10 letters of the alphabet (A-J). For instance, the number “8” and the letter “h” are both represented by “
”. An escape code (
) is placed before numbers to differentiate them from letters. Therefore “8” is actually “
” whereas “h” is only “
”.
The Braille input to English text translation is achieved as follows: The input sentence comprising of Braille symbols is divided into words. If a word in braille is a contraction, then it is checked against the corresponding text stored in the dictionary, and the given braille symbols of the contraction are mapped to the corresponding English text. The remaining individual words in Braille are broken down into their constituent Braille symbols and mapped to their corresponding alphabets and digits. The Braille text is converted into the English text, using a grade 2 Braille dictionary, and the final text will appear on the instructor’s screen. This way, the students can convey their messages to the instructors and their classmates effectively.
A perfect Braille translation can only be done by humans, as it requires an understanding of the content at hand. Moreover, many Braille contractions follow pronunciation and programming which can be quite fiddly. For instance, the “gh” contraction that is used in “tough” would not be used in “high.” Hence, the Braille Translator used in the proposed SLA tool does not include partial contractions, which may increase the complexity of the code [18]. In this way, the tool enables the instructors to understand and clarify the doubts of the students with visual impairment during the lecture.
Validation of the proposed SLA tool
A sample of students comprising of various disabilities and teachers was taken into consideration for validating the SLA tool. The details of the sample are given in Table 5. The tool was demonstrated and the responses were taken using feedback forms given in Appendix 1. The feedback responses were analyzed using Content Analysis that helped in further improvement of the proposed tool.
Summary of the profile of the subjects for the research
Summary of the profile of the subjects for the research
From the received responses, it was found the tool provided an effective medium of communication in a real-time classroom as shown by the pie-charts given in Fig. 7a). 90 % of the sample endorsed that they could actively and confidently participate in the teaching-learning process like other students using the proposed tool as shown in Fig. 7b).

Feedback Responses for the SLA Tool a) Effective medium of the communication b) Active participation in the teaching-learning.
The tool was presented to four speech-impaired and three hearing-impaired students. The input gesture images of the sign language as conveyed by the students were taken by a camera of their device called Actual Value. The images were pre-processed and RGB images was converted to black and white images, and frames were created for the input gestures. Thereafter, these frames were recognized by the tool and finally, the acknowledged gestures are shown in Table 6. The accuracy of Sign Language to text conversion depends on the lighting condition. There should be sufficient light for the best results. The background should be plain and a white/light-colored background was used for capturing the Sign Language gestures. There was no delay in input to output, thereby allowing the speech and hearing-impaired students to communicate in real-time. The output English text corresponding to Sign Language gestures was displayed as soon as the camera started capturing them.
Sign Language to Text conversion
Sign Language to Text conversion
The speech and hearing impaired students prefer to converse using sign language. To check the efficacy of the tool, the audio inputs of the instructor consisting of common words like Hello, Start, Okay, etc. were given to the SLA tool. The translated Sign gesture images by the SLA tool as given in Table 7 were presented to three Hearing and four Speech Impaired students. It was observed that audio clips were converted to their Sign Language representation accurately if the speaker had clarity in speech. As the GIFs of only a limited number of sentences or syllables were included in the database, this mode of communication could only be used for general conversations. The database can be expanded by including more useful sentences in the future, as per the requirements.
Speech to Sign Language gesture conversion
Speech to Sign Language gesture conversion
The speech-to-text conversion Instructor module was tested by three Hearing impaired students for few sentences as shown in Figs. 8 and 9. The text received by the students corresponding to the audio inputs of the instructor is shown in Table 8. It was observed that the proposed SLA tool could successfully detect the speech and converted the same into English text. However, for accurate speech detection and conversion, the instructor was requested to be loud and clear while speaking. Background noises were avoided to get accurate results. The resultant English text was displayed on the screen after 3 to 4 seconds of delay.

Snapshot of the conversion from Speech to Text.

Audio input to Text output conversion.
Speech to Text conversion
For best results, the instructor should pause after every 5 to 8 seconds of speaking.
The students with visual impairment prefer to use Braille for conversation. So the Braille input from one visually impaired student was given to the SLA tool and the converted English text as received from the tool is shown in Fig. 10. The resultant text was perceived by the instructor that enabled him to communicate with the students.

Braille to Text conversion.
A variety of Assistive devices are available in the market to address the needs of disabled students. E.g. I-communicator [8], Smart-Brailler [13], Sign Aloud [12]. But these devices are very expensive and may not be in the reach of a common student. Also, these devices address a specific kind of disability.
A navigation system V-eye [19] was proposed in our previous work for visually impaired persons. The SLA tool proposed in this work is a software application using Deep learning, Computer vision, and Python. The techniques used to develop this tool are available Open Source and thus making this tool affordable to a common student. Besides this, it allows sharing the same classroom set up for all students with or without disabilities.
In case a student is speech-disabled but can hear the instructors’ voice, then this tool translates his sign language to English text that can be understood by Instructor and thus allows seamless communication in the regular classroom.
A visually impaired student can hear the lecture and understand. But for communicating with the instructor, this tool translates the Braille input of the student to English text for two-way communication.
For a hearing-impaired student, this tool converts the speech of the instructor to sign language and vice versa and thus allows students to interact with the instructor.
The tool addresses the particular disability by simply acting as a specific translator between the instructor and a disabled student. Besides, if a student has multiple disabilities, then also this tool ensures a very effective and smooth two-way communication between the student and instructor.
Conclusions
Recent advancements in Artificial Intelligence technologies have transformed the way Inclusive Education is being realized nowadays. The proposed tool provides a means to bridge the communication gap which is usually felt by the students with speech, hearing, and visual impairments in a regular classroom set-up. It may serve as one of the best solutions to accomplish seamless two-way communication of disabled students with the instructor, thus eliminating the need for special classrooms for them. As the existing open-source technologies are used to design and implement the SLA tool, it is also affordable for a common student. The user acceptance testing of the tool was performed with the differently-abled students and the feedback obtained from them through discussions was utilized to improve its performance.
This paper presented an SLA tool using various python packages and libraries such as Py Audio, Speech Recognition, Tkinter, and Open CV, exploiting deep learning and computer vision techniques. To implement the Sign to the text conversion, the Sequential CNN model was trained and utilized. The validation accuracy for the trained CNN was obtained as 99.09%. Moreover, the Hearing-Impaired and the Visually-Impaired modules provided correct results almost all the time while the Speech-impaired module could recognize the Sign-language input gestures correctly 45 times out of 50.
However, for Sign language to text conversion, the CNN model was trained only on 26 alphabets, ten digits, and 4-5 phrases. The training process took few seconds, but if the tool needs to be trained on the whole English vocabulary, the training process may require substantial computational power, time, and storage, which will necessitate the use of high-end systems for the execution of the tool.
Furthermore, with the use of more advanced technologies, the functionality of the proposed tool can be extended to improve its Accuracy, Usability, Reliability, and Comfortability. The datasets on which the tool has been trained can be improvised by adding more relevant data to achieve a dedicated system for the differently-abled students. It can also serve as a strong base for developing more sophisticated and customized systems for an inclusive classroom. The SLA tool can be integrated with the devices, generally used by specially-abled students, to enhance their productivity and responsiveness.
Footnotes
Acknowledgments
The authors would like to thank visually, speech, and hearing-impaired students for their feedback and evaluation through responses and discussions to improve the proposed tool. We express our special thanks to Dr. Manoj Garg, Convener, Equal-Opportunity Cell, and Ms. Rupali Pabreja, Nodal Officer, Persons with Disabilities of Acharya Narendra Dev College, University of Delhi, Delhi for providing the necessary support to carry out the research.
Declaration
The authors declare that all procedures performed in the study involving disabled as well as other participants were as per the ethical standards of the institutional Research Review Committee. Informed consent was obtained from all individual participants involved in the study.
Appendix
Participant Feedback Form for taking part in the research
