Abstract
Humanoid robots are being introduced in places where people do not speak the same language and people expect quick, natural responses. In such situations, speech interaction cannot afford noticeable delays. Most of the present speech-to-text systems are mainly maintained with cloud servers, leading to latency problems, reliance on dependable connectivity, and failures on the fly when used in real time. These shortcomings become especially evident when robots are expected to autonomously and continuously interact with human users. To address these limitations, this project proposes a new edge-centric speech-to-text framework tailored specifically for the multilingual humanoid robot domain. Instead of sending audio data to the cloud, this method performs speech processing directly within the robot. This technology includes lightweight neural models for real-time streaming, an onboard mechanism that allows for real-time identification of the target language, and local caching methods for quicker retrieval of repeated or known speech patterns. Combine these and you can get quicker, more trustworthy transcription without burning a hole in network resources. The system reduces communication delays to a great extent, while providing transcribed data in multiple languages due to local handling of speech from the wireless edge network. The time taken in overall response is more than 60% lower than the response time used in cloud-based systems; it has been found in experiments. More critically, the framework does a good job with fluctuating network bandwidth, loss of packets, and background noise. It is concluded that edge-based and multilingual speech-to-text systems will be important for humanoid robots to enhance responsivity and contextuality. Understanding faster results in faster reactions, smoother conversations, and moments of interaction that feel more natural is a major step toward pragmatic and reliable communication between humans and robots in the working world.
Keywords
Introduction
Humanoid robots are transitioning from laboratory prototypes to practical assistants in households, health care, education, and service industries. For these robots to function as effective collaborators, real-time comprehension of natural spoken language is crucial. Unlike command-based interfaces, human conversation is spontaneous, diverse, and often multilingual1,2 and even sometimes consists of good level of large-scale framework; for example, mSLAM.
3
Large-scale multilingual corpora, e.g., Indic-ST,
4
have made it possible to train and evaluate speech models in low-resource languages. Designing systems that can process speech efficiently under such conditions presents multiple challenges. The first challenge lies in latency. Typical speech-to-text (STT) frameworks normally work well along with cloud-based processing and computing. Generally, all such cloud services, like Google Speech API and Microsoft Azure, provide strong accuracy but incur latency and network dependency issues5,6 as they incur significant delays because of network transmission and dependability on stable connectivity. Such delays interrupt conversational flow and degrade interaction quality7,8 and the supposed intelligence and awareness of any robot or humanoid systems. Another issue is related to usability, as real-world environments are linguistically diverse, and any user can switch languages depending upon need during conversation. Maximum existing models typically need distinct training for large-scale multilingual networks, which are computationally heavy1,2 for each language, or large-scale multilingual networks, which are computationally heavy, resulting in unsuitability for deployment on high-speed resource-constrained wireless edge devices and networks. Another challenge involves wireless network conditions as Humanoid robots commonly interact through Wi-Fi/LTE or emerging 5G networks, which introduce bandwidth fluctuations
9
and packet loss issues, which have several demerits and need attention in countering the issues of bandwidth fluctuations tackling packet loss in subnet, and latency issues in interference, which keep on compounding, and last but not the least are related to real-time responsiveness (Fig. 1).
Speech processing tech.
Recent advancements in multilingual models, streaming speech recognition architectures, and edge computing open up exciting avenues, but are still scattered. High-throughput models like Whisper and its lightweight adaptations generalize well but are resource intensive; they generalize to more challenging languages and noise levels but are just too resource-heavy for edge use cases. Likewise, studies of low latency inference often center around single language benchmarks in controlled environments, limiting their potential for realistic humanoid usage. To overcome these drawbacks, this paper proposes a cross-lingual STT framework that is suitable for edge deployment. It is developed based on:
Lightweight neural STT architectures (RNN-T and Conformer based) optimized for streaming inference, Adaptive language identification to adapt to the language to make code-switching from one language to the other easier, Edge-level caching and quantization reduce memory overhead in multiple layers to enable faster retrying of interactions, Wireless optimization strategies with long runtime for the unstable network (Fig. 2). Systematic architectural framework novel edge-enabled architecture for real-time, multilingual STT in humanoid robots. STT, speech-to-text.

Unlike existing cross-lingual STT frameworks that primarily focus on improving recognition accuracy through large-scale models, the proposed approach emphasizes real-time deployability under edge constraints. The novelty of this work lies in its unified integration of adaptive language identification, caching, and quantization within a single pipeline optimized for humanoid interaction, rather than introducing isolated optimizations evaluated in offline settings.
The major contributions of this work are summarized as follows:
We propose an edge-enabled multilingual STT architecture tailored for humanoid robotic systems, designed to operate under strict latency and resource constraints. An adaptive language identification mechanism is introduced to support real-time multilingual and code-switching speech scenarios. Lightweight caching and quantization strategies are integrated to reduce inference latency and memory footprint without compromising recognition performance. The proposed framework is extensively evaluated under realistic network conditions, including varying bandwidth and noise levels, to demonstrate its practical deployability.
Related Work
Research on multilingual STT for humanoid robots intersects with four active areas: (i) cloud-based STT frameworks, (ii) lightweight low-latency architectures, (iii) multilingual and cross-lingual modeling, and (iv) edge computing for speech processing. In all these fields, partial solutions are offered, but they also reveal limitations that drive this study.
Cloud-based speech-to-text systems
Cloud-Based Speech-to-Text Systems Commercial cloud-based STT services provide strong accuracy in data across various languages with the Google Speech API and Microsoft Azure STT and have been widely adopted in multilingual applications such as health care and education systems. However, they require stable connectivity and remote processing, which leads to latencies that disrupt real-time interaction. Studies like OWLS 10 demonstrated that increasing model parameters and training data improve accuracy in high- and low-resource language models, but these models are unsuitable for time-sensitive applications due to network delay and computational weight. Similarly, fine-tuned variants of Whisper 11 perform well with challenging dialect-heavy data (such as Swiss German), but they rely on heavy models, restricting on-device usage.
Low-latency neural STT architectures
Recent developments in sequence-to-sequence-like architectures12,13 such as RNN-T and streaming ASR models 14 with further improvements achieved through knowledge distillation and model compression techniques 15 allow for streaming inference in a shorter time frame. Recent efforts, such as the SM2 12 model demonstrated, trained with weakly supervised data for multilingual streaming recognition, have reported competitive zero-shot translation performance. However, these techniques also show promising limitations regarding latency, such as performance of open streaming dataset (e.g., random forest network). Although they reduce inference time, most are tested in a single language or under restricted conditions, and cross-lingual adaptability and real-world deployment have not been well investigated.
Cross-lingual and multilingual STT
Cross-lingual and Multilingual STT: The study of multilingual corpora, such as MuAViC dataset,16,17 which combines ∼1,200 hours of audio-visual speech across nine languages, confirms the potential of combining modalities to enhance recognition in noisy contexts.15,18 However, humanoid robots cannot rely on stable visual input, necessitating robust audio-only multilingual STT. In addition, language switching presents special challenges. Code-switching challenges19,20 have been highlighted in prior work19,20 reported that code-switching is more challenging than previously assumed and models struggle to recognize rapid language boundaries. These limitations highlight the need for lightweight, precise, and adaptable language identification techniques19,21 for real-time multilingual applications, including zero-resource and adversarial language identification approaches. 22
Edge computing for speech processing
Edge Computing for Speech Processing Edge-based inference has recently gained popularity as a technique to alleviate network congestion and latency. Combining quantization and pruning techniques23–25 shows that large STT models can be condensed for resource-constrained applications with minimal accuracy loss. Yet many focus on single-language cases or single benchmarks. However, comprehensive evaluation of cross-lingual STT on wireless edge nodes—especially under unstable conditions like packet loss or bandwidth variability—is often lacking. Based on our findings, we propose three key gaps in this literature: Latency remains high in cloud-enabled multilingual STT, regardless of transcription accuracy. Code-switching is under-addressed, with few models capable of smooth, low-latency interlanguage translations. Edge optimization strategies have been explored26,27 for multilingual STT, including containerized edge deployment frameworks for speech processing, 28 which has been limited in the literature by wireless, noisy, and resource-limited environments common in humanoid deployment. To fill this gap, this work proposes a framework that includes the following: Streaming-ready multilingual STT architectures tuned for low-latency; Adaptive language identification for real-time code-switching; Use of edge with caching and quantized edge optimization strategies26,27 has been explored, including containerized edge deployment frameworks for speech processing to achieve speed, accuracy, and resource efficiency. Real-environment validation with network performance and robustness not reported in previous studies. 28 This integration broadens the field in terms of accuracy, real-time responsiveness, and edge deployability, all combined in one unified package for humanoid robots at scale in a multilingual context.
Positioning of the proposed work and identified research gaps
Proposed Work Positioning and Research Gaps identified Recent work on multilingual STT systems can be roughly classified into three directions. The first is cloud-based multilingual STT frameworks that are large models for centralized servers to perform well for transcription of multiple languages. The second subset refers to low-latency streaming ASR architectures (e.g., RNN-T, Conformer-based models) to reduce inference delay but are evaluated in controlled or single-language environments.1,2 The third group studies edge-optimized ASR solutions, but this can only be done in monolingual settings or with naive deployment assumptions. Although improved multilingual accuracy and model scalability are reported in some recent survey papers,1,2 they are generally related to training efficiency and possible dataset expansion. The elements involved in practical deployment—such as switching from one language to another in real time, wireless instability, and resource constraints at the edge—are often regarded separately or even completely excluded. Consequently, the current literature does not propose a common methodology for each of these problems in realistic robotic or humanoid interaction contexts. Based on this analysis, the following research gaps are identified:
Absence of cross-lingual STT frameworks evaluated under wireless edge deployment conditions. Limited support for real-time multilingual code-switching in streaming ASR pipelines. Lack of systematic latency–accuracy–resource trade-off analysis for humanoid and interactive robotic systems.
The proposed framework directly addresses these gaps by integrating adaptive language identification, caching mechanisms,29,30 and model quantization into a unified edge-oriented STT pipeline. Unlike prior studies that introduce isolated optimizations, this work focuses on end-to-end deployability, enabling low-latency, multilingual speech recognition under practical network and hardware constraints. Highlighting the need for integrated edge-based multilingual STT systems7,8
Proposed Methodology and Framework
The proposed framework integrates multilingual STT processing, lightweight neural inference, and wireless edge deployment 26 to achieve real-time humanoid interaction. The methodology unfolds in five phases, as illustrated in the flowchart Figure 3.

Proposed methodology flowchart.
End-to-end workflow and real-time processing pipeline
Figures 3, 4, and 5 represent the full end-to-end workflow of the proposed edge-based multilingual STT framework. Then the incoming audio stream is buffered continuously, and segments are handled in bite-size frames so that speech can be processed in little time. By applying this framing logic, the system could operate as a stream—a requirement for real-time humanoid interaction. A separate language identification module is also operated in parallel with acoustic feature extraction. It minimizes the overall processing latency by studying the segments of speech in parallel instead of sequentially. After the recognition of the spoken language, it dynamically selects the respective lightweight STT model to be executed by the edge device. This enables faster processing when a cached match exists due to bypassing full inference of the model. Moreover, quantized inference is used to balance between recognition accuracy and computational efficiency on the edge hardware, making the architecture suitable for edge hardware. All time-critical features—including language detection, decoding, and post-processing—are automatically scheduled with real-time precision. This coordinated execution maintains speed in speech understanding and response generation to maintain the pace that can allow seamless interaction between humans and humanoid robots.

Proposed methodology process flow.

Framework diagram for cross-lingual low-latency STT real-time humanoid interaction. STT, speech-to-text.
Dataset preparation
A multilingual corpus was compiled consisting of English, Hindi, Spanish, and Mandarin speech samples. The dataset includes various accents and dialects to enhance robustness, real-world conditions with background noise and overlapping speech, and augmentation techniques (time stretching, pitch shifting, and noise injection) to facilitate generalization (Table 1).
Summary of datasets used for training and evaluation
All audio samples were resampled to 16 kHz and normalized prior to training. The in-house dataset consists of speech recorded in indoor environments using consumer-grade microphones, covering variations in accent, speaking rate, and background noise.
Preprocessing
Incoming audio was processed using the method of Voice Activity Detection (VAD) which eliminates silent intervals. Followed by Filters environmental noise while retaining natural features. At next step, feature extraction is done, which is used to determine Mel-Frequency Cepstral Coefficients (MFCCs), and spectrograms were generated for neural input. At the last step of preprocessing, streaming segmentation is done so that audio streams were divided into small frames, enabling incremental transcription with minimal delay.
Model design and training
Two optimized neural models are being practised in the process: RNN-Transducer (RNN-T) and Conformer-based Transformer architectures, which capture both local and global dependencies. A dynamic language identification module has been integrated, automatically routing speech frames to the appropriate language model. Training has been conducted using transfer learning from large pretrained multilingual STT models along with Compression strategies (pruning, quantization, and knowledge distillation) for edge deployability compression techniques23,25,29 along with structured pruning and distillation techniques. 16
Edge deployment
Edge nodes hosted the optimized models, significantly reducing cloud dependency. The deployment pipeline involved in the framework so that parallel execution of preprocessing → STT inference → humanoid response generation can be achieved. Caching frequently used phrases for fast response in repeated interactions and adaptive strategies for wireless fluctuations, adjusting inference rates dynamically in the edge deployment.
Evaluation metrics
The proposed system framework has been benchmarked using word error rate (WER) for accuracy of transcription, end-to-end latency 31 giving delay from speech input to humanoid response, throughput and robustness for system stability under bandwidth variations and cross-lingual adaptability that can provide performance under dynamic language-switching scenarios.32,33
The selected languages were chosen to reflect realistic human–robot interaction scenarios in multilingual environments. English and Hindi represent widely used languages in domestic and institutional settings, while Spanish and Mandarin were included to evaluate cross-lingual scalability. Noise conditions ranging from 0 dB to 30 dB SNR were selected to simulate realistic acoustic environments encountered by humanoid robots, including quiet indoor spaces, moderate background activity, and highly noisy conditions. Packet-loss rates 34 were configured to reflect wireless edge deployments 35 subject to intermittent connectivity, particularly in mobile or crowded environments.34,35 These design choices ensure that the evaluation closely mirrors real-world HRI operating conditions (Fig. 6).

Proposed CLLL-STT over WEN for RTHI.
Proposed algorithm: Cross-Lingual Low-Latency STT for humanoid robots
Noise simulation and calibration procedure
Injecting real-world background noise samples with controlled signal-to-noise ratios from 0 dB to 30 dB acted as the control to simulate noise conditions. Noise simulated data were generated from natural ambient sounds by injecting real-world noise samples into clean speech samples under an SNR setting to evaluate robustness against adverse acoustic conditions. Noise samples were sourced from publicly available environmental sound archives and included office chatter, fan noise, and ambient room noise. The mixing procedure was also standardized by the method of the RMS-based normalization to achieve uniform SNR values for all test samples. With constant steps, different SNR values were used for this study, ranging from 0 dB to 30 dB, with both a moderate and an elevated noise value. This calibration helped achieve repeatability and facilitated equal comparison between models in the same acoustic environment.
Implementation and reproducibility details
Experiments have been performed within PyTorch deep learning framework. All model training and inference were carried out on a hardware platform that can easily take care of edge development with multicore CPU and GPU acceleration. Audio signals were sampled at 16 kHz, and inference of fixed-size audio segments was executed since the system operates in real-time. Less memory usage and less computational cost were associated with implementing reduced-precision weights for quantized inference. Baseline sample size, learning rate, decoding setting, batch size, and other model parameters were unchanged in all the experiments to ensure fair comparison between variables. The same implementation and configuration were used to conduct baseline experiments and ablation experiments. These details are included for reproducibility and independent verification of the reported results.
Simulations and Discussion
This part studies results of the proposed cross-lingual, low-latency STT framework for humanoid robots. The experiments took place in four languages (English, Hindi, Spanish, and Mandarin) under different network and noise conditions. We contrast the approach with two baselines: cloud STT services and a naive edge-only STT system without language adaptation and caching. Results are reported in terms of accuracy, latency, robustness, and adaptability. The hardware specifications, the dataset configurations, and the network conditions used for their evaluation are reported in Table 2 for the simulation parameters. The edge node was designed to be the representation of a resource-constrained deployment to ensure that any improvement observed was feasible for real humanoid robots. Instead of traditional k-fold cross-validation, repeated scenario-based evaluations were adopted due to the streaming and real-time nature of the proposed system.
Simulation parameters summary
Table 3 compares the average WER, latency, throughput, and robustness across the three systems. The proposed method consistently outperforms the naive edge system and significantly reduces latency compared with the cloud baseline.
Comparison of average word error rate, latency, throughput, and robustness
Proposed edge STT: WER ≈ 10%, latency ≈ 120 ms, throughput ≈ 40 req/s.
Cloud STT: slightly lower WER (9.5%) but ∼3 × higher latency (350 ms).
Edge-naive: moderate latency (200 ms) but higher error rate (12%).
These results validate that our framework provides precise answers in a timely manner for human–robot interaction. We conducted an ablation study (Table 4) to validate the contributions of individual components. Removing caching increased latency by ∼30 ms and slightly degraded accuracy. Excluding language identification forced the model to rely on a single multilingual network, raising WER to 11.5%. Disabling quantization improved WER marginally but at the cost of >3× memory consumption, reducing edge deployability. These findings emphasize that each component—caching, LangID, and quantization—contributes to the system’s efficiency.
Ablation study
WER, word error rate.
A brief ablation study on the contribution of each component toward the general framework was performed by disabling selectively the three major components mentioned in this proposal—adaptive language identification, caching, and quantization. We treat the full configuration of the system as an exemplary comparison. Notably, we can observe that when stripping out the adaptive lightweight language identification techniques,21,30 there is an obvious increase in latency in multilingual and code-switching situations. This reduction illustrates the critical role of early language detection in avoiding the model switching overhead and unnecessary inference delays, particularly in real-time user interaction. As usual, when you disable this caching mechanism, it means taking longer for the average inference time for more frequently typed commands. Therefore, caching is a simple way of stopping duplicate computation, and this is particularly valuable when we consider humanoid applications, which tend to use very similar repeat speech. With the absence of caching, it seems that the recognition accuracy has certainly not significantly improved, but responsiveness is also not as fast as it used to be. Elimination of quantization will lead to increased memory usage and thus compute performance loss. This will directly affect latency. And though often this model can be good for obtaining marginal accuracy gains under conditions of full-precision inference, it is ill-suited for edge-based deployment in real-time due to its latency penalties. This observation shows that quantization is necessary to find that balance between accuracy and time efficiency. The ablation results show that the independent contribution of each subsystem is high, and the combination results in a good trade-off between latency, robustness, and resource consumption compared with the other modules. This enhances the design decision of a comprehensive framework for real-time multilingual speech-to-text processing on edge devices.
Figure 7 presents latency vs bandwidth for the three systems. While consistent with prior latency-aware STT studies, cloud STT latency sharply rises in low-bandwidth conditions (up to 450 ms at 1 Mbps); the proposed edge STT maintains sub-220 ms latency even under extreme constraints. At higher bandwidths, our system stabilizes around 118–120 ms, ideal for conversational humanoids. This proves that edge-based inference eliminates the dependency on stable high-throughput networks.

End-to-end latency variation with available bandwidth.
Analysis of this simulation is explained as the proposed system consistently maintains lower latency at all bandwidth levels. The small accuracy drop (from 9.5% in cloud to ∼10% WER) is likely due to edge quantization and limited context but acceptable for many real-time interactions. Throughput advantages are clear: edge system can serve more concurrent requests due to reduced transmission overhead. Figure 8 depicts WER as a function of SNR. Both proposed and cloud systems perform comparably at medium-to-high SNR (>15 dB). However, at very low SNR (0–5 dB), cloud STT benefits from larger offline training datasets, showing slightly better accuracy. Still, the proposed edge STT remains within 2% to 3% of cloud WER, validating its robustness for noisy, real-world environments similar to multimodal robustness approaches explored in prior work (Table 5). 26

Noise robustness: WER variation across different SNR levels.
Word error rate (WER) values corresponding to Fig. 7
STT, speech-to-text.
The edge model’s robustness holds well in realistic noise, and there is a trade-off, which has very noisy conditions still Favor the large cloud models with richer data; could consider more noise augmentation or ensemble. Figure 9 shows the latency comparison among the three systems, reinforcing that our edge-based design is ∼66% faster than cloud-based transcription. Figure 10 shows throughput degradation under packet loss (Table 6).

Average end-to-end latency comparison.

Throughput vs packet loss.
End-to-end latency measurements corresponding to Fig. 8
STT, speech-to-text.
Figure 10 throughput performance under increasing packet loss, which provides edge deployment, shows greater resilience to network instability. Along with this, it also indicates effective caching and local inference reduce reliance on repeated transmissions, which is helpful under packet loss. Dynamic language identification was tested under rapid language-switching scenarios (English–Hindi, Hindi–English, English–Mandarin, Spanish–English). As shown in Figure 11, the system adapts within 95–130 ms, ensuring natural and uninterrupted interaction across languages. This demonstrates the effectiveness of the lightweight LangID module.

Language switching latency for proposed framework.
Figure 11 gives the indications of latency incurred during cross-lingual language switching. LangID module works effectively even with short utterances, but switch latency could be improved further for some language pairs. The delay includes buffer for confidence threshold: trade-off between false switches vs delay. Figure 12 illustrates the caching hit-rate across repeated interactions. With only three to four repetitions, the hit rate exceeds 40%, and after 10 interactions it stabilizes near 85%. This indicates that in practical scenarios (e.g., humanoids working in repetitive environments like households or hospitals), the system achieves substantial efficiency gains through caching.

Caching HR vs repeated interactions.
Figure 12, Caching hit-rate vs number of repeated interactions, shows caching serves repeated utterances well, which is critical for humanoid robots in routine tasks (e.g., greetings, commands). Removing LangID increases WER because interference between languages degrades the multilingual model’s performance. Full precision improves accuracy slightly but at unacceptable resource cost for edge use.
Results and Key Findings
The proposed edge STT provides significantly lower end-to-end latency (≈120 ms) than that of cloud-based STT (≈350 ms) but maintains comparable WER (Proposed ≈10% vs Cloud ≈9.5%). It proves the system does not sacrifice much precision for a huge latency gain—a critical aspect for humanoid responsiveness. Even when bandwidth is low (≤10 Mbps) the proposed system is much more responsive than cloud-based frameworks, since inference occurs locally (though cloud latency degrades sharply when bandwidth is scarce). The noise robustness 31 (WER vs SNR) demonstrates that our proposed model is capable of performance as compared with cloud STT at medium-to-high SNRs (≥15 dB). Even with very low SNRs (≤5 dB), cloud models trained on larger corpora still have a slight edge. Throughput under packet loss proves that the proposed edge STT is more resilient—throughput decreases gradually with packet loss (from ∼42 req/s → 32 req/s at 10% loss) compared with cloud STT.
Caching gives an operational benefit—when you are doing a lot of caching, its hit-rate goes up quickly, which reduces latency and computation at inference time. All of these (caching, LangID, quantization) are confirmed by ablation: removing caching increases latency and WER slightly; removing quantization reduces accuracy but increases memory footprint. The study uses ablation to highlight how subsystem performance is interlaced and how each subsystem contributes to overall system performance. We also found that by eliminating the adaptive lightweight language identification techniques21,30 we obtain higher latency on multilingual transitions and the importance of this module for real-time switching for language. And of course, disabling the caching mechanism causes bigger inference latencies between repeated commands, whereas lack of it has a large increase in the computational burden for the repetitive commands without an effect as much on accuracy. The present results demonstrate that each part is used in a unique way for latency reduction, robustness, and use of the resources (Table 7).
Comparison with baseline
STT, speech-to-text; WER, word error rate.
Computational complexity and resource analysis
Computational complexity was considered in the form of inference latency, memory footprint, and processing throughput rather than the abstract asymptotic notation. Quantization reduced the model size and memory access costs, and caching prevented repetitive inference in the case of familiar utterances. This twofold approach together caused reductions in average latency and variance under sustained workloads, therefore validating the suitability of the framework for real-time edge deployment.
Overall trade-offs and limitations latency vs accuracy
Small decrease in WER (<1%–2%) will be offset by huge improvements in latency. Quantizing or pruning reduces memory and compute, but some edge devices having very low memory, or no hardware acceleration at all, would still struggle (Table 8).
Representative error cases
Qualitative analysis shows that most transcription errors occur during rapid code-switching or under extremely low SNR conditions, as discussed in robust speech recognition literature. These cases typically involve partial word deletions or substitutions, indicating limitations in language boundary detection rather than acoustic modeling.
Qualitative error analysis
Qualitative analysis indicates that most transcription failures occur during rapid language switching or under extremely low SNR conditions. In such cases, errors typically manifest as word substitutions or partial deletions rather than complete sentence-level failures. Accent-related errors were also observed for speakers not well represented in the training data. These observations suggest that future improvements should focus on faster language boundary detection and improved robustness to accent variability (Table 9).
Representative transcription error cases observed
Conclusion and Future Work
This paper proposed an edge-enabled, cross-lingual-based speech-to-text framework for humanoid robots in multilingual, real-world systems. Moving computation from the cloud server to wireless edge nodes, the system proves that the humanoid assistants can deliver the responsive responses necessary for natural human interaction and user-perceived latency constraints in HRI systems. 35 Experimental studies revealed a 60%–70% reduction in end-to-end latency compared with the standard cloud-based process while preserving transcription accuracy within 1% to 2% of cloud baselines. Notably, the framework showed resilience against bandwidth variability, packet loss, network degradation effects studied in and background noise and thereby did not suffer from poor performance and was stable under dynamic conditions. Lightweight language identification techniques facilitated smooth code-switching, and compression techniques, including quantization, pruning, and caching, made such deployment possible even on constrained hardware. Collectively, these results underscore that low-latency multilingual, and resource-light STT, a resource-efficient STT, both could be possible and feasible for humanoid robots (HRAs, thus representing a contribution to the human–robot interaction research community). Despite the significant enhancements of the proposed system, there still remain opportunities for improvement. For example, extending to languages with low-resource, underrepresented language varieties, including those with high dialectal diversity and code switching (which has persistent problems with current work). One even more significant opportunity is to improve robustness to harsh acoustic environments. Generalization at the crowded or reverberant environmental condition would be improved by adding room impulse response augmentation using different ambient noise sources in a real-world context as well as sophisticated noise modeling. Apart from the accuracy, the framework can be enhanced further by means of adaptive model scaling that would enable the system to dynamically change the complexity based on the appropriate battery, compute, and network robustness. Hybrid edge–cloud partnership may also be another area, and edge-based inference is good in both response time and privacy shield, but for a noisy and a long utterance and selectively offloading on cloud may also be good. The last critical things remain important—user-centric evaluations remain exceedingly important. The method in practical settings, which are homes, schools, and hospitals, can provide a deep understanding of the effect of response delay, recognition accuracy, and conversational flow on user satisfaction and trust, the study explains. To this end, the recent outcomes have clearly demonstrated a transformational role of a multilingual speech-to-text at the edge to facilitate humanoid robots from mere reactive mechanisms to active participants in a natural context. This way of addressing the problem can be implemented in an appropriate and secure way in a flexible manner in order to develop an open platform to deploy interactive robots across diverse linguistic, acoustic, and network contexts. One of the key contributions of this work is to demonstrate that low-latency, multilingual STT can be achieved without using big cloud-based models. Alternatively, it can be successfully realized on edge devices themselves. Through articulating system-wide limitations and linguistic diversity, the work contributes to the evolution of more flexible, resilient, sustainable humanoid robots that can work in physical real-world environments. The experimental approach also highlights practical trade-offs between latency, accuracy, and resource use. So methods such as aggressive quantization and caching lessen the burden of computation but introduce a time complexity for effective fine-tuning to retain the quality of the recognition in numerous languages as well as noisier environments. This knowledge is particularly critical because speech-enabled robotic systems are required to perform on constrained hardware. The timely communication is important in the natural human–robot interaction. But from the standpoint within a speech-to-text environment, significant delays beyond a permissible threshold will interrupt turn-taking, reduce users’ interactions, and destroy users’ confidence in the system. So, it is necessary to maintain this timing limit. The framework is in a sensible compromise with parallel execution paths to avoid delays, caching to prevent repeated inference, and quantization to reduce computation. Accordingly, it is ideally suited for continuing to work in mobile humanoid robots in dynamic, resource-constrained settings aligning with emerging trends in edge-based speech processing.
Future Research Directions
Next steps will focus on scaling the framework to better accommodate low-resource and underrepresented languages with limited training data and linguistic tools. Concurrently, the solution will be bolstered with dynamic scaling features that can adapt processing and resource utilization to changing runtime scenarios. The hybrid edge–cloud collaboration solution enables more selective offloading for response while still managing low-latency interactions (for complex or computation-heavy requests). In addition to technical enhancements, we are also planning work to conduct live user studies around long-term deployments in humanoid robots. The goal of this work is to measure interaction quality, user acceptance, and system robustness over the long term of the software implementation, providing us with useful feedback and help to finalize further work to make our framework perform reliably in everyday use.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261134]. This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R432), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
