Abstract
Dysarthria, a motor speech disorder characterized by slurred and often unintelligible speech, presents substantial challenges for effective communication. Conventional automatic speech recognition systems frequently underperform on dysarthric speech, particularly in severe cases. To address this gap, we introduce low-latency acoustic transcription and textual encoding (LATTE), an advanced framework designed for real-time dysarthric speech recognition. LATTE integrates preprocessing, acoustic processing, and transcription mapping into a unified pipeline, with its core powered by a hybrid architecture that combines convolutional layers for acoustic feature extraction with bidirectional temporal layers for modeling temporal dependencies. Evaluated on the UA-Speech dataset, LATTE achieves a word error rate of 12.5%, phoneme error rate of 8.3%, and a character error rate of 1%. By enabling accurate, low-latency transcription of impaired speech, LATTE provides a robust foundation for enhancing communication and accessibility in both digital applications and real-time interactive environments.
Introduction
Dysarthria is a complex motor speech disorder resulting from impaired neuromuscular control, characterized by difficulties in articulation, phonation, and prosody. This condition manifests as slurred, slow, or monotone speech, significantly compromising intelligibility and expressiveness.1–6 Dysarthria often arises from neurological conditions such as stroke, Parkinson’s disease, and amyotrophic lateral sclerosis, leading to varied severity levels that can profoundly affect social interactions, emotional well-being, and overall quality of life. Individuals with dysarthria frequently encounter barriers to effective communication, resulting in frustration and diminished self-esteem, thereby impacting their participation in educational and occupational. 7 Conventional automatic speech recognition (ASR) systems, including widely used technologies such as Google speech-to-text, excel at processing fluent, normal speech but struggle to accurately recognize dysarthric utterances. The irregularities associated with dysarthric speech such as imprecise phoneme production, variable speech tempo, and inconsistent vocal quality complicate the conversion of speech to text. Effective recognition is essential not only for facilitating direct communication but also for enabling various applications that require transforming dysarthric speech into comprehensible text, thus enhancing interaction with digital devices and services. 8
Despite advancements in speech technology, there remains a considerable gap in developing ASR systems tailored for dysarthric speech. Previous research has primarily focused on conventional acoustic modeling approaches, which fail to accommodate the unique phonetic and acoustic challenges presented by dysarthria. Addressing these complexities is crucial for creating specialized ASR solutions capable of accurately decoding dysarthric utterances. Recent progress in neural networks, deep learning, and natural language processing has significantly advanced medical science, with successful applications in disease diagnosis, medical imaging, and neurological disorder detection.9–13 Inspired by these achievements,14–19 this study leverages state-of-the-art deep learning architectures to address the pressing challenge of dysarthric speech recognition, aiming to bring similar breakthroughs to the domain of impaired speech processing.
This study introduces an advanced low-latency acoustic transcription and textual encoding (LATTE) framework for dysarthric speech recognition. The framework integrates convolutional layers to extract salient spatial representations from complex acoustic patterns and employs bidirectional temporal layers to capture dependencies in both forward and backward directions. By combining these processes within a streamlined recognition pipeline, the LATTE framework effectively addresses the irregular rhythm, variable speech rates, and unpredictable pauses associated with dysarthria, while ensuring computational efficiency suitable for real-time applications. Ultimately, this research seeks to enhance the usability of speech recognition technologies for individuals affected by dysarthria, contributing to improved communication accessibility, inclusivity, and overall quality of life.
While prior studies have investigated transformer-based architectures and self-supervised frameworks for impaired or dysarthric speech, they often suffer from high latency and computational overhead, making them unsuitable for real-time deployment. Moreover, limited exploration has been directed toward low-latency architectures that balance speed, efficiency, and accuracy for spontaneous dysarthric speech. The proposed LATTE framework advances beyond prior work by introducing a lightweight, latency-optimized hybrid architecture that achieves faster inference with minimal accuracy degradation. Unlike earlier transformer or conventional models that emphasize accuracy at the cost of processing time, LATTE incorporates streamlined convolutional front-end layers and efficient temporal encoders, achieving real-time responsiveness without sacrificing recognition precision. This design uniquely positions LATTE as a bridge between clinical-grade speech recognition accuracy and practical, assistive technology usability, broadening its relevance across both health care and human–computer interaction domains.
The remainder of this article is organized as follows: The “Related Work” section reviews the related work in dysarthric speech recognition. The “Mathematical Representation of LATTE Framework” section presents the mathematical modeling of the proposed framework. The “Dataset and Participant Characteristics” section describes the dataset and participant characteristics. The “Methodology: LATTE Framework” section details the proposed LATTE framework. The “Evaluation and Results” section presents the evaluation criteria and the results. The “Ethical Considerations and Societal Impact” section discusses ethical considerations and social impact. The “Error Analysis” section describes an error analysis, and the “Conclusion and future works” section details the conclusion and the future directions.
Related Work
Dysarthric speech recognition has gained significant attention in recent years due to the growing need for accessible and efficient communication technologies for individuals with speech impairments. The inherent challenges of dysarthric speech such as high variability, reduced intelligibility, and limited corpus availability have motivated the adoption of advanced machine learning and deep learning architectures. Recent advancements in low-latency ASR and cross-lingual speech-to-text modeling have also contributed to improving real-time and multilingual speech systems, offering valuable insights applicable to impaired speech recognition tasks. For instance, authors in and 20 proposed a low-latency ASR framework based on dynamic streaming transformers optimized for real-time inference, while 21 introduced a cross-lingual self-supervised speech-to-text model enabling robust multilingual adaptation. These developments highlight the broader research direction toward efficient, adaptive, and inclusive ASR systems. Consequently, hybrid deep architectures, transformer-based models, multimodal fusion techniques, and self-supervised learning (SSL) paradigms have been increasingly explored to enhance both recognition accuracy and computational efficiency. This section reviews state-of-the-art contributions and highlights methodological advancements across these dimensions, concluding with a structured analysis table that consolidates the key techniques and outcomes reported in the literature.
Geng et al. 22 explored data augmentation techniques such as vocal tract length perturbation, tempo, and speed perturbation, applied to both normal and disordered speech. Using the UA-Speech corpus, their approach achieved a word error rate (WER) of 26.37%. Similarly, Celin et al. 23 proposed a two-level augmentation strategy using a virtual linear microphone array-based synthesis followed by multiresolution feature extraction. Their hybrid DNNHMM– system trained on UA-Speech and Tamil dysarthric speech corpora reported a WER of 35.75%.
Chandrakala et al. 24 introduced a histogram of states-based approach within a Deep Neural Network-Hidden Markov Model (DNN–HMM) framework, enabling compact and discriminative embeddings of dysarthric utterances. Their method reported WERs of 53.12%, 73.74%, and 43.15% across subsets of the UA-Speech corpus. Yu et al. 25 proposed the MAV-HuBERT (Mean Absolute Value–Hidden-Unit BERT) framework, which integrates both audio and visual modalities, incorporating a convolutional neural network (CNN)-based facial feature encoding in the first stage and AV-HuBERT pretraining in the second stage. This approach addressed model overfitting issues and achieved a WER of 63.98% on UA-Speech. Similarly, Xiong et al. 26 incorporated articulatory-based representations alongside acoustic features using LSTM-RNNs, reporting a WER of 48.36%.
Recent transformer-based approaches have further advanced the field. Mahum et al. 27 employed a Swin transformer for spectrogram-based spatial feature extraction, reaching a WER of 16.4%. He et al. 28 developed a dual-stream transformer integrating convolutional encoders and TDNN, achieving 17.3%. Yue et al. 29 proposed a multimodal transformer that combined audio and visual features, reporting 18.2%. Peng et al. 30 designed a phoneme-aware conformer, improving temporal alignment with a WER of 14.9%. Shahamiri et al. 31 incorporated context-awareness into their transformer, yielding a WER of 15.1%. Hu et al. 32 investigated multilingual Wav2Vec2.0 pretraining but faced generalization challenges, reporting 17.6%. Mehmood et al. 33 utilized an SSL–CTC framework, with a WER of 19.3%. Finally, Shujie et al. 34 integrated SSL into TDNN–conformer fusion with Wav2Vec2.0-based rescoring, achieving an improved WER of 18.17% on UA-Speech.
Table 1 summarizes these studies, highlighting the techniques, corpora, and reported recognition performance.
Comparative analysis of related work in dysarthric speech recognition
BiLSTM, bidirectional long short-term memory; CNN, convolutional neural network; LATTE, low-latency acoustic transcription and textual encoding; SSL, self-supervised learning; VTLP, vocal tract length perturbation; WER, word error rate; W2V2, Wav2Vec 2.0.
Bold value highlights the result of the proposed framework.
Mathematical Representation of LATTE Framework
In this study, the task of dysarthric speech recognition is formulated as the mapping of impaired speech signals into their most probable textual encodings under a low-latency constraint. The proposed framework, termed LATTE, is designed to integrate acoustic modeling, temporal mapping, and recognition into a unified hybrid structure. Given an input acoustic sequence
Formally, dysarthric speech recognition in LATTE can be expressed as a conditional probability maximization:
The framework combines convolutional layers for acoustic feature extraction with bidirectional long short-term memory (BiLSTM) layers for temporal mapping and contextual modeling. The CNN component enhances robustness by extracting salient local spectral–temporal features from dysarthric speech, reducing distortions due to articulation impairments. The BiLSTM component then captures forward and backward temporal dependencies, which are crucial for recognizing speech with irregular rhythm, variable pacing, and inconsistent prosody. The CNN feature extraction process is defined as:
Dataset and Participant Characteristics
The availability of publicly accessible speech corpora containing dysarthric speech samples is limited. Notable examples include Nemours, 35 TORGO, 36 and UA-Speech. 37 Additionally, Google has developed the Euphonia dataset, 38 which includes dysarthric samples; however, this dataset was not accessible during the course of this study, and our request for access was not fulfilled. Among the existing datasets, UA-Speech is the largest, encompassing a higher number of dysarthric participants and being the most frequently utilized corpus in dysarthric speech recognition research. 39 Consequently, it serves as the foundation for this study.
The UA-Speech corpus is a widely recognized dataset specifically designed for research in dysarthric speech recognition. This dataset is invaluable for developing and evaluating models aimed at improving speech-to-text transcription for individuals with dysarthria. It encompasses a diverse range of speech samples from a variety of speakers, including different ages, genders, and levels of dysarthria, thereby providing a broad representation of dysarthric speech patterns. UA-Speech was developed by the University of Illinois researchers and features speech samples obtained from 19 dysarthric individuals with speech intelligibility levels ranging from 2% to 95%. The intelligibility levels of dysarthric speakers range from very low (0%–25%) intelligibility to low (25%–50%), mild (50%–75%), and high (75%–100%) intelligibility. Speech intelligibility can be defined as to what extent speech is comprehensible by a typical listener and is one of the mechanisms to define the severity of dysarthria. The corpus overall provides speech samples collected from 28 speakers, including 15 dysarthric speakers and 13 healthy control speakers—the speech samples of the other four dysarthric participants are not publicly available (speakers M02, M03, F01, M06). The dataset features both spontaneous and read speech recordings, which are instrumental in capturing the inherent variability and distinctive characteristics of dysarthric speech. Each speech sample is meticulously transcribed with phonetic and orthographic annotations, ensuring precise alignment between spoken utterances and their textual counterparts. Such detailed transcription is vital for training effective and accurate speech recognition models. Furthermore, the recordings are conducted under controlled conditions to minimize background noise and recording artifacts, thus enhancing the overall data quality and reliability. With thousands of utterances available, the UA-Speech corpus provides a robust foundation for training, validating, and testing various speech recognition systems. Its extensive utilization in diverse research studies focused on speech synthesis, recognition, and pathology underscores its significance as a fundamental resource for advancing dysarthric speech recognition technologies, ultimately facilitating improved communication for individuals affected by dysarthria. The dataset details are summarized in Table 2.
Summary of the UA-Speech corpus and participant characteristics
F, female; M, male; N/A, not available.
Methodology: LATTE Framework
In this section, we introduce the proposed LATTE framework for dysarthric speech recognition. The section discusses the overall pipeline of the framework, including input processing and transcription mapping of the dysarthric corpus, the acoustic preprocessing and feature extraction procedures, and finally the architecture of the proposed framework. Together, these components outline how LATTE systematically processes speech data, aligns it with textual representations, and supports efficient and reliable recognition.
Preprocessing and transcription mapping for dysarthric corpus
This study utilizes the UA-Speech corpus, a benchmark dataset for dysarthric speech recognition. Preprocessing plays a pivotal role in organizing and standardizing the corpus, thereby ensuring accurate feature extraction and reliable model training. Key parameters such as MAX_LEN, VOCAB_SIZE, and NUM_MFCC were defined to control input sequence length, vocabulary coverage, and the dimensionality of MFCCs. Directory paths for audio files, Master Label Files (MLFs), lexicons, and speaker wordlists were systematically specified to streamline resource access.
Metadata was extracted from speaker and word-recording Excel sheets, enabling a precise mapping of each utterance to its corresponding audio file. The MLFs were parsed to align utterance-level transcriptions with audio samples, and tokens were normalized to uppercase for consistency. The final output was a structured dictionary, where audio filenames served as keys and transcribed sequences as values, ensuring robust pairing between speech and text.
The overall process is illustrated in Figure 1, while the detailed dataset mapping is provided in Table 3. This condensed pipeline establishes a reliable foundation for addressing the challenges of impaired speech recognition.

Condensed preprocessing pipeline for dysarthric speech corpus.
Systematic mapping of speech utterances to audio files and transcriptions
Acoustic preprocessing and feature extraction
Feature extraction transforms raw dysarthric speech into discriminative representations suitable for recognition. In this study, we employ MFCCs, which capture perceptually relevant spectral properties of speech.
The process begins by segmenting the signal into overlapping frames, followed by applying the discrete Fourier transform to obtain frequency spectra. These spectra are filtered on the Mel scale to emphasize human auditory sensitivity, and the logarithm of the filterbank energies is decorrelated using the discrete cosine transform (DCT). The i-th MFCC coefficient is computed as:
For each audio sample Xn with sampling rate sr, the MFCC matrix is given by:
This structured representation enables the model to learn mappings between impaired acoustic patterns and their linguistic transcriptions. Figure 2 illustrates the MFCC extraction pipeline. The raw dysarthric speech signal is first segmented using framing and windowing, followed by Fast Fourier Transform to convert each frame into its frequency-domain representation. The Mel filter bank then emphasizes perceptually meaningful frequency bands, and logarithmic compression stabilizes amplitude variations. Finally, the DCT decorrelates the compressed filterbank energies to produce compact MFCC feature vectors used as model inputs.

Layered MFCC extraction pipeline for dysarthric speech. MFCC, Mel-frequency cepstral coefficient.
To prepare the dataset, transcriptions were first loaded into a dictionary mapping each audio identifier to its text. The tokenizer converted text into integer sequences with a vocabulary size defined by VOCAB_SIZE, and sequences were padded for uniformity. Input features X were similarly padded, and the dataset was split into training and testing subsets (80:20).
Since model outputs operate on fixed time steps, transcription labels were downsampled to match the output resolution. Given a sequence y and target length L, downsampling is defined as:
This alignment ensures temporal consistency between labels and predicted outputs, stabilizing model training and evaluation. In our case, all label sequences were standardized to 25 steps. Table 4 summarizes the preprocessing parameters applied.
Summary of preprocessing parameters
LATTE architecture design
This section describes the LATTE architecture utilized for dysarthric speech recognition. The model integrates CNN layers and BiLSTM layers to effectively process and classify feature-extracted speech data. The architecture is designed to leverage the strengths of both CNNs and LSTMs, capturing both local features and temporal dependencies from the input speech sequences.
The input to the CNN–BiLSTM model is a sequence of feature vectors derived from MFCCs. Each input sequence X is represented as:
In the CNN component, the model employs one-dimensional convolutional layers to extract local patterns from the input sequences. The convolution operation is defined as:
The output from the BiLSTM is concatenated as:
The subsequent layer is a TimeDistributed Dense layer with 64 units and ReLU activation. This layer is applied to each time step to map the LSTM outputs into a higher-dimensional space while preserving the temporal structure.
The final output is produced by a TimeDistributed Dense layer with softmax activation. This layer outputs a probability distribution over the vocabulary size. The softmax function is defined as:
The architecture of the proposed model LATTE is outlined in Table 5, which presents the structure of the network, including the types of layers used, their output shapes, and the number of trainable parameters. This table provides a clear depiction of the model’s overall complexity and computational footprint. The model begins with an input layer that defines the input dimensions but contains no trainable parameters. Following this, a series of convolutional layers (e.g., Conv1D) are employed to extract local features from the input data. These layers contribute a substantial number of parameters—for instance, the first convolutional layer comprises approximately 7744 trainable parameters. To reduce the feature dimensionality and retain the most salient characteristics, pooling layers such as MaxPooling1D are applied, which do not introduce additional parameters. To mitigate overfitting and enhance generalization, dropout layers are interleaved within the architecture; these layers also contain no parameters. The subsequent bidirectional recurrent layers play a crucial role in capturing long-range temporal dependencies in both forward and backward directions. These layers substantially increase the parameter count, with two bidirectional layers containing approximately 263,168 and 394,240 parameters, respectively. Finally, TimeDistributed layers are utilized to apply dense transformations across each temporal step in the sequence, with parameter counts of around 16,448 and 29,575 for the two such layers.
Model architecture overview
This structured architecture enables efficient feature extraction, temporal modeling, and sequential representation learning while maintaining a balance between performance and computational efficiency.
Experimental settings
For reproducibility, the proposed LATTE framework was evaluated on the UA-Speech corpus. Audio was resampled to 16 kHz, and 13-dimensional MFCCs were extracted with 25 ms frames and 10 ms overlap. Transcriptions were tokenized, padded to 25 steps, and split 80:20 for training and testing. The LATTE framework used 2 CNN layers (64 and 128 filters, kernel size 3), 2 BiLSTM layers (128 units each), and TimeDistributed Dense layers. The dropout rate was 0.2. Training used the Adam optimizer (learning rate 0.001), sparse categorical cross-entropy loss, batch size 16, and 10 epochs. Experiments were conducted on a workstation with an Intel Core i7 CPU, NVIDIA RTX 3080 GPU, 32 GB RAM, Python 3.11, and TensorFlow 2.14.
Computational efficiency analysis
The computational efficiency of the proposed LATTE framework is comprehensively evaluated in terms of trainable parameters, time complexity, and memory utilization. The LATTE framework has been meticulously designed to ensure real-time operability, low-latency response, and lightweight processing, which are essential characteristics for dysarthric speech recognition in practical communication systems. LATTE integrates two principal processing modules: convolutional layers for acoustic feature extraction and temporal layers for modeling sequential dependencies across speech segments. This design paradigm not only enhances the efficiency of feature representation but also maintains computational scalability across both training and inference phases. The computational complexity of the convolutional block in LATTE can be expressed as:
The overall computational footprint of LATTE as presented in Table 6. demonstrates an optimal balance between model expressiveness and runtime efficiency. Parameter sharing, dropout regularization, and hierarchical feature compression further minimize redundancy and enhance the system’s capacity for low-resource deployment.
Computational efficiency summary of the low-latency acoustic transcription and textual encoding framework
These findings confirm that LATTE achieves real-time inference with minimal computational overhead, substantiating its suitability for deployment in resource-constrained and latency-sensitive speech recognition environments. The architecture’s streamlined acoustic-to-textual processing pipeline ensures superior speed, scalability, and precision, distinguishing LATTE as a robust framework for real-time dysarthric speech communication.
Evaluation and Results
To evaluate the performance of our proposed dysarthric speech recognition model, we employ several widely used metrics: accuracy, loss, character error rate (CER), WER, phoneme error rate (PER), and F1-score. These metrics together provide a comprehensive view of transcription performance, spanning overall prediction correctness, character- and phoneme-level accuracy, and segment-level balance between precision and recall. LATTE is evaluated on a corpus of UA-Speech English dysarthric speech, which contains diverse speakers and exhibits irregular articulation patterns, making it an ideal benchmark for assessing transcription robustness.
Accuracy measures the proportion of correctly predicted samples among all predictions, while loss quantifies the difference between predicted and true labels. Accuracy and categorical cross-entropy loss are defined as:
WER, CER, and PER evaluate transcription errors at the word, character, and phoneme levels, respectively. These metrics are especially important in dysarthric speech due to irregular phoneme production and atypical articulation. They are calculated as:
F1-score provides a harmonic mean of precision and recall, balancing over- and under-prediction for segment-level recognition:
Training and validation performance
LATTE was trained for 10 epochs. Table 7 summarizes epoch-wise training and validation performance.
Epoch-wise training and validation performance of low-latency acoustic transcription and textual encoding
As observed in Figure 3, training accuracy improves consistently across epochs, reflecting progressive learning of dysarthric speech patterns. Validation accuracy rises in parallel, indicating strong generalization. Concurrently, training and validation losses decrease steadily, demonstrating effective optimization and feature extraction at character, phoneme, and word levels.
The accuracy curves as indicated in Fig. 3 show that LATTE is progressively refining its ability to map complex dysarthric speech to correct textual output. Similarly, the loss curves confirm effective convergence and reduction of transcription errors. Table 8 presents a detailed evaluation across all metrics.
Error metrics of the proposed low-latency acoustic transcription and textual encoding model
The low CER and PER as illlustrated in Fig. 4 highlight LATTE’s ability to accurately capture fine-grained phonetic and character-level patterns. The WER demonstrates effective word-level transcription, even with irregular articulation, while the high F1-score confirms a balance between precision and recall at the segment-level transcription.
Ablation study analysis
In this study, we conducted a series of experiments to evaluate the individual contributions of CNNs and BiLSTM networks in dysarthric speech recognition. These experiments were performed on the UA-Speech dataset. We first employed a CNN-based model for dysarthric speech recognition. The model achieved a WER of 93.87%, indicating that while CNNs are effective in capturing local features in speech signals, they may not fully capture the temporal dependencies inherent in dysarthric speech. Next, we applied a BiLSTM-based model. The BiLSTM model achieved a WER of 77.5%, demonstrating its capability to model temporal dependencies in speech. However, the performance was still suboptimal, suggesting that solely relying on temporal modeling may not be sufficient for accurate dysarthric speech recognition. Subsequently, we integrated both CNN and BiLSTM components into our proposed LATTE framework. This hybrid approach achieved a WER of 12.5%, significantly outperforming both the CNN-only and BiLSTM-only models. The integration of CNN allowed for the extraction of local features, while the BiLSTM component effectively modeled the temporal dependencies, leading to improved performance. These results underscore the complementary strengths of CNNs and BiLSTMs in dysarthric speech recognition. While CNNs excel at capturing local features, BiLSTMs are adept at modeling temporal dependencies. The combination of both in the LATTE framework leverages these strengths, leading to superior performance in recognizing dysarthric speech patterns. An overview of ablation study is presented in Fig. 5.
Comparison with baseline architectures
To validate LATTE, we compared it with two recent strong baselines: Hu et al. 2024 (Multilingual Wav2Vec2.0) and Mahum et al. 2025 (Swin Transformer). Table 9 summarizes the comparison.
Comparison of proposed low-latency acoustic transcription and textual encoding with recent baselines
CER, character error rate.
Bold value highlights the result of the proposed framework.
LATTE achieves lower WER and CER than baselines and maintains nearly half the inference time, demonstrating superior transcription performance and low-latency applicability. Figure 6 illustrates this comparison visually. Overall, LATTE shows strong generalization, consistent low error rates across characters, words, and phonemes, and efficient computation, confirming its suitability for real-time dysarthric speech-to-text transcription.

Training and validation performance of LATTE. Left: accuracy curves; right: loss curves across 10 epochs. The minimal gap indicates strong generalization. LATTE, low-latency acoustic transcription and textual encoding.

Box plot of WER for LATTE on test samples, illustrating low error variance and robust transcription. WER, word error rate.

Component-wise ablation study analysis of LAT architecture on dysarthric speech. CER, character error rate; PER, phoneme error rate.

Bar chart comparison of WER and CER for LATTE and recent baselines.
Ethical Considerations and Societal Impact
This research adheres to the highest ethical standards, particularly given the sensitivity of working with dysarthric speech data. All data from the UA-Speech corpus were handled with strict privacy protocols, ensuring that participant identities were anonymized and protected. Proper informed consent was ensured wherever applicable, especially for those with speech impairments, allowing participants to fully understand the purpose and use of their data in developing assistive technologies. Bias in model development was another key consideration. Dysarthric speech varies significantly across individuals, and we aimed to develop a model that could recognize a wide range of speech patterns inclusively. This helps ensure that no specific group of individuals with speech disorders is disproportionately disadvantaged by the system’s performance. Additionally, ethical safeguards are in place to prevent misuse of the technology, particularly in contexts where privacy concerns or harm could arise. The societal impact of this research is significant, particularly in supporting individuals with speech impairments in clinical and social settings. Improved speech-to-text transcription can assist clinicians in diagnosing and monitoring speech disorders more effectively, while also enhancing social integration for individuals with dysarthria by enabling clearer and more accurate communication. This work can substantially reduce the communication barriers faced by dysarthric patients, contributing positively to their quality of life and their broader societal inclusion.
Error Analysis
The recognition of dysarthric speech inherently involves significant acoustic and articulatory variability, especially across different severity levels of impairment. In the LATTE framework, three primary sources of transcriptional errors were observed: (i) variability in articulatory precision, (ii) spectral–temporal distortion due to inconsistent phoneme production, and (iii) dataset imbalance across severity categories. To provide a deeper insight, we further quantified performance variation across dysarthria severity levels, categorized as mild, moderate, and severe based on the intelligibility annotations of the UA-Speech corpus. Table 10 presents the WER, PER, and CER across these levels.
Severity-wise error analysis of dysarthric speech recognition using low-latency acoustic transcription and textual encoding
PER, phoneme error rate.
As shown, the error rate progressively increases with dysarthria severity. LATTE maintains strong performance for mild and moderately impaired speakers due to its CNN–BiLSTM hybrid design, where convolutional layers capture short-term acoustic cues, and the BiLSTM units effectively model temporal coarticulation dynamics. However, in severe dysarthria, significant phoneme reduction, elongated articulatory transitions, and reduced formant stability cause higher phoneme substitution and deletion errors, resulting in a noticeable rise in WER and PER. The hybrid network exhibits limitations in capturing extremely distorted articulatory trajectories where intra-speaker variability dominates.
Additionally, the scarcity of training samples for severe cases within the UA-Speech dataset introduces class imbalance, affecting model generalization. Background noise and microphone inconsistencies further contribute to temporal misalignment between predicted and reference phoneme boundaries. These observations underline the importance of enhanced preprocessing, intelligibility-based data augmentation, and adaptive acoustic modeling to improve robustness in low-intelligibility scenarios. Future extensions of LATTE will explore intelligibility-aware loss functions, attention-based refinement, and transformer-assisted spectral encoding to minimize such degradation across severity levels.
Conclusion and Future Work
Dysarthric speech, characterized by slow, slurred, and often indiscernible patterns, presents a significant challenge for ASR systems. The intricacies of dysarthric speech patterns, combined with its sensitive nature, make it difficult for conventional models to accurately capture and transcribe. Existing research in the field has often reported higher WERs, underscoring the limitations of previous approaches in recognizing such speech accurately. This limitation poses a significant barrier for dysarthric patients, whose ability to communicate through automated systems is hindered by these recognition challenges. This study has proposed an advanced LATTE architecture aimed at addressing these difficulties in dysarthric speech recognition. Our approach has demonstrated a marked improvement in accuracy and a substantial reduction in the WER compared with existing models. By focusing on the specific challenges of dysarthric speech, such as its unpredictable variations and slurred articulation, our model was able to produce more reliable speech-to-text transcriptions.
While the results are promising, certain limitations remain. The experiments were conducted on the UA-Speech dataset, which primarily contains English-language recordings; therefore, the generalizability of the proposed model to other languages or datasets with differing phonetic and acoustic structures may require further validation. Moreover, as the model leverages deep architectures combining convolutional and recurrent components, the computational complexity can be significant, particularly for real-time or resource-constrained applications. Future research will explore model compression, cross-linguistic adaptation, and dataset diversification to enhance the scalability and applicability of the LATTE framework across diverse linguistic and computational settings.
The implications of this work extend beyond technical accuracy. Accurate speech transcription for dysarthric patients can significantly enhance their ability to communicate, offering vital support in both personal and clinical settings. By reducing the gap between spoken words and their textual counterparts, this technology can alleviate some of the social and psychological impacts experienced by individuals with dysarthria. Ultimately, our proposed framework contributes to an improved quality of life for dysarthric individuals, fostering greater independence and social inclusion, while also providing clinicians with a more robust tool for supporting their patients. This research underscores the importance of continued innovation in the field of dysarthric speech recognition and highlights the potential for technological advances to have a profound positive impact on society.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
Authors’sContributions
Qurat Ul Ain led the overall research and made the primary intellectual contribution to this study. She conceptualized the research problem, designed the methodological framework, carried out data preprocessing and feature extraction, developed and implemented the proposed models, conducted all experimental evaluations, and performed an in-depth analysis of the results. She also drafted the original manuscript and finalized it through multiple revisions in accordance with journal standards. Hammad Afzal supervised the research and provided substantial academic and technical guidance throughout the study. He contributed to refining the research objectives, validating the methodological design, and critically reviewing the experimental setup and performance evaluation strategy. Fazli Subhan supported the preliminary analysis and assisted in reviewing the manuscript and provided suggestions that helped improve clarity and presentation. Mazliham Mohd Suud contributed to the literature review process and assisted in manuscript proofreading and formatting. Youghyun Jung provided high-level academic insight and reviewed the manuscript to ensure technical accuracy, coherence, and alignment with the scope of the journal.
