Abstract
Abstract: The increasing volume of multilingual news broadcasts highlights the need for advanced systems capable of transforming speech into semantically comparable text across languages. Traditional speech-to-text and textual similarity methods often fall short in handling linguistic diversity, contextual ambiguity, and cross-lingual semantic alignment. To overcome these limitations, we introduce a Transformer–Graph Neural Network (GNN) integrated framework for multilingual news speech-to-text similarity modeling. This article presents an approach that leverages a Transformer encoder to extract deep contextual embeddings from speech inputs, capturing sequential and contextual nuances. These embeddings are then structured into graphs that represent semantic relations among words, phrases, and sentences. A GNN refines these graph-based representations by modeling relational dependencies across languages. Finally, a cross-lingual semantic alignment module produces similarity scores, enabling accurate transformation of multilingual speech into comparable text. Experiments conducted on benchmark multilingual news video datasets in English, Hindi, Marathi, and Tamil show that our framework consistently outperforms baseline models, including standalone Transformers and GNNs. The model achieved significant gains, with improvements of 7.8% in semantic similarity accuracy, 6.1% in BLEU score, and 8.4% in cross-lingual alignment efficiency. Furthermore, it demonstrated robustness to noisy input, code-switching, and low-resource language scenarios, making it suitable for practical multilingual news applications. The proposed approach achieved a relative improvement of 4.8% in semantic similarity and a 3.1% reduction in word error rate compared with the baseline models. Future directions include extending the framework for real-time deployment, expanding support to underrepresented languages, and incorporating multimodal news data for enriched global media analysis.
Keywords
Introduction
The exponential growth of digital media platforms has transformed the way news is disseminated and consumed worldwide. With the rise of multilingual content, especially in Asia where linguistic diversity is vast, ensuring accurate cross-lingual understanding of news has become a pressing challenge. News agencies frequently broadcast the same information in multiple languages, including English, Hindi, Marathi, and Tamil, to reach broader audiences. However, the semantic interpretation and retrieval of this multilingual speech content remain difficult due to variations in grammar, vocabulary, pronunciation, and contextual usage across languages. 1
Traditional speech-to-text systems have achieved notable progress in transcription accuracy, but their effectiveness declines when applied to highly diverse linguistic environments. 2 Furthermore, existing text similarity approaches often focus on monolingual or bilingual scenarios, limiting their applicability in real-world, multilingual contexts. As a result, there is a need for advanced Natural Language Processing (NLP) frameworks capable of capturing both semantic and structural relationships in multilingual speech-to-text transformations. 3
From a technological perspective, multilingual speech-to-text conversion facilitates cross-lingual information retrieval, enabling users to search, compare, and analyze news content across languages. 4 It supports semantic similarity detection, ensuring that identical or related news stories in different languages can be aligned and aggregated. This capability is particularly useful for fact-checking, detecting misinformation, and tracking the progression of narratives across linguistic boundaries. 5
Moreover, multilingual speech-to-text systems contribute significantly to media accessibility.6,7 They empower individuals with hearing impairments by providing textual equivalents of spoken news and assist language learners by offering accurate transcriptions across multiple languages. 8 Governments, policymakers, and researchers also benefit from such systems, as they can analyze multilingual media streams in real time to monitor public sentiment, policy impact, or crisis communication. 9
Recent advances in deep learning, particularly Transformer architectures, have revolutionized NLP by enabling the modeling of long-range dependencies and contextual embeddings.10–13 Transformers such as BERT, mBERT, and XLM-RoBERTa have demonstrated success in multilingual tasks, but their performance is often constrained when semantic relations among sentences and phrases need to be explicitly represented. To bridge this gap, Graph Neural Networks (GNNs) have emerged as a complementary paradigm, excelling at modeling relational structures within data. The integration of Transformers with GNNs thus offers a powerful mechanism to combine contextual sequence learning with relational graph-based reasoning.14–15
In this work, we explore the application of a Transformer-GNN integrated framework for generating semantically comparable text from multilingual Asian news speech. By leveraging English, Hindi, Marathi, and Tamil news datasets, the proposed approach first transcribes speech into language-specific embeddings and then constructs semantic graphs to represent inter- and intralingual relationships. A GNN layer further refines these embeddings by capturing cross-lingual dependencies, ultimately generating similarity-based text outputs. This framework not only enhances semantic alignment across languages but also provides robustness against noise, code-switching, and low-resource language challenges frequently observed in Asian media.
The contributions of this article are threefold:
Present a novel Transformer-GNN hybrid model for multilingual speech-to-text similarity generation across four linguistically diverse Asian languages. Developed a preprocessing pipeline to convert raw news videos into noise-free, high-quality speech samples. Designed a Transformer model to extract deep contextual embeddings from multilingual speech. Proposed a GNN module to capture semantic and relational dependencies across languages. Generated semantically aligned text outputs by leveraging speech-to-text similarity modeling.
The remainder of the article is organized as follows: Section 2 reviews related work, Section 3 describes the proposed methodology, Section 4 presents the experimental setup and results, and Section 5 concludes with future research directions.
Material and Methods
Study design
The proposed study is designed to address the challenges of transforming multilingual news speech into semantically comparable text by integrating the strengths of Transformer-based contextual learning and GNN-driven relational modeling. The workflow follows a structured sequence beginning with raw video acquisition, followed by preprocessing, contextual modeling, relational refinement, and similarity-driven text generation.
Data acquisition and preprocessing
News video datasets were collected from multilingual sources covering English, Hindi, Marathi, and Tamil from both broadcast and online platforms to ensure linguistic diversity. The audio streams were extracted using FFmpeg with a sampling rate of 16 kHz, 16-bit PCM, and mono channel configuration. To prepare high-quality inputs, a comprehensive speech enhancement pipeline was employed. Initially, denoising was performed using spectral subtraction and Wiener filtering to suppress background noise. Silence segments were removed through Voice Activity Detection (VAD), which combined energy-based and neural thresholds for accuracy. The processed signals were then normalized using RMS scaling and peak normalization to achieve consistent loudness levels across samples. To further improve model robustness, data augmentation techniques were applied, including speed perturbation at factors of 0.9, 1.0, and 1.1, the addition of background noise at signal-to-noise ratios (SNRs) ranging from +5 to +15 dB, and SpecAugment-based time and frequency masking. The resulting clean audio samples were converted into log-Mel spectrograms, providing a robust input representation for the Transformer model.
Proposed framework
The overall architecture of the proposed system is illustrated in Figure 1. The workflow begins with news video input, which undergoes speech enhancement and preprocessing to produce clean audio samples. These samples are converted into spectrograms and processed by the Transformer encoder, generating contextual embeddings. The embeddings are then structured into a semantic graph, which is refined using a GNN. The final stage applies a semantic similarity module to produce text outputs that are semantically aligned across languages.

Block diagram architecture of proposed approach.
The block diagram highlights the sequential yet integrated nature of the pipeline, showcasing the synergy between contextual modeling through Transformers and relational refinement via GNNs. By leveraging both sequential and graph-based learning, the proposed framework achieves improved semantic alignment and robustness in multilingual speech-to-text similarity tasks.
Speech Enhancement and Cleaning
The speech enhancement and cleaning stage ensured that the extracted audio samples were of high fidelity before downstream processing. Denoising was achieved through spectral gating combined with Wiener filtering, and in cases of persistent noise, refinement was optionally performed using a lightweight deep neural network enhancer (RNNoise-style) model. To suppress nonspeech interference, hum and background music were reduced using notch filters at 50/60 Hz alongside harmonic suppression techniques. Normalization was applied by setting the peak amplitude at −1 dBFS and performing per-utterance RMS scaling to maintain consistency across recordings. As part of quality control, any segment with an estimated SNR below 10 dB, determined using an energy-based estimate, was discarded to avoid contaminating the dataset with poor-quality audio.
Following enhancement, speech segmentation was conducted to prepare the audio for feature extraction. VAD was applied using a hybrid energy-based and neural model approach, enabling accurate discrimination between speech and nonspeech regions. Short gaps of less than 300 ms between adjacent speech segments were merged to preserve continuity, while excessively long utterances were capped at a maximum length of 20 seconds for computational efficiency. Optionally, speaker diarization was employed using x-vector embeddings combined with spectral clustering to address crosstalk and multispeaker scenarios. Segments with high overlap between speakers were discarded to ensure the dataset primarily contained clean, single-speaker speech samples.
Feature Extraction
For effective representation of speech signals, feature extraction was performed using log-Mel spectrograms, which are widely regarded for their robustness in speech and multilingual processing tasks. Each audio segment was converted into an 80-dimensional log-Mel spectrogram computed with a 25 ms analysis window and a 10 ms hop length, providing a fine temporal resolution while capturing the spectral characteristics of speech. To further capture dynamic variations in the signal, first-order (Δ) and second-order (ΔΔ) derivatives of the Mel features were optionally computed, enriching the representation with temporal trajectory information. This combination of static and dynamic features ensured that both spectral shape and temporal dynamics of speech were preserved, which is particularly valuable in modeling prosody and intonation across languages.
To minimize the impact of channel variability and recording conditions, cepstral mean–variance normalization was applied on a per-segment basis. This normalization step reduced interspeaker and interchannel variability by normalizing the feature distributions, thereby improving the robustness of the downstream models. The resulting features provided a compact yet informative representation of speech, well-suited for processing by the Transformer encoder and subsequent GNN modules in the proposed framework.
Transformer Encoder for Speech
The Transformer encoder forms the core of the contextual modeling stage, converting spectrogram sequences into high-dimensional embeddings that capture both local and global speech dependencies. The input to the encoder consists of preprocessed log-Mel spectrogram sequences

Block diagram architecture of Transformer encoder.
To reduce computational complexity and capture local patterns, a convolutional front end is applied that divides the input into overlapping patches along the time axis. Each patch is then projected into the model dimension
Here, the Transformer encoder consists of 12 identical layers, each with 8 attention heads. For each head, the query
Next, the outputs of all heads are concatenated and projected as
Finally, on the residual connections, layer normalization was applied to produce the feature map as given by Equation 5.
Also, each of the encoder layer contains a position-wise feed-forward network whose description is given as
GNN for Relational Refinement
After extracting contextual embeddings from the Transformer encoder, the next step is to capture semantic and relational dependencies across words, phrases, and sentences. This is achieved by constructing a semantic graph
While for creation of the graph network, the edges are defined based on multiple criteria like:
Temporal adjacency that connects consecutive tokens as given as: Semantic similarity that connects nodes with cosine similarity above a threshold
The final edge weight combines these factors as
Over the GNN layer towards message passing and node update, a Graph Attention Network (GATv2) is applied to propagate information across the graph. For node i, the updated embedding
After L GNN layers, node embeddings are aggregated to form sentence- or segment-level representations as
Semantic Similarity and Text Generation
The final stage of the framework focuses on aligning multilingual speech representations with semantically equivalent textual outputs. After relational refinement through the GNN, the resulting embeddings
To measure semantic closeness between speech-derived embeddings and candidate text embeddings, cosine similarity is employed as given by Equation 10.
Once semantic similarity is established, the embeddings are passed into a Transformer-based decoder for text reconstruction. Given the refined speech embedding
The overall flow of the methodology is finally summarized into an algorithm as presented below (Algorithm 1).
Results
In the experimentation, the dataset was partitioned using a 70%–15%–15% training, validation, and test split, respectively. This allocation was chosen to ensure a balanced trade-off between model learning capacity and unbiased evaluation. The 70% portion provides sufficient data diversity for training across four languages and varied noise conditions, which is particularly important for the Transformer and GNN modules to learn robust semantic representations. The 15% validation set enables reliable hyperparameter tuning and early stopping without risking overfitting to the training data. Finally, allocating 15% to the test set ensures statistically meaningful performance assessment across multilingual samples, especially given variability in accent, prosody, and speech quality. We experimented with alternative proportions, for example, 80–10–10 and 75–15–10, and found that the 70–15–15 split offered the most stable validation performance while preserving ample test coverage, which supports the rationale behind our chosen configuration. To execute the model, the hyperparameters were fine-tuned, and the experimentation values of such are presented in Table 1.
Model hyperparameters used in model training
GNN, Graph Neural Network.
Quantitative evaluation
The proposed Transformer–GNN framework was evaluated on multilingual news datasets comprising English, Hindi, Mandarin, and Spanish broadcast sources. Performance was measured using metrics relevant to both speech-to-text alignment and text generation quality. For semantic similarity, we report cosine similarity scores, STS Pearson/Spearman correlations, and Recall@K for retrieval tasks. For text generation, we used BLEU, ROUGE-L, and BERTScore to assess fluency and semantic adequacy.
Our model achieved an average BLEU score of 36.4, ROUGE-L of 42.7, and BERTScore of 0.87, outperforming baseline Transformer-only and GNN-only models by significant margins. In similarity estimation, the proposed method reached STS correlation of 0.83, exceeding multilingual ASR baselines (0.72) and standard seq2seq models (0.68). Retrieval accuracy (Recall@5) was 91.2%, showing strong alignment across languages. The results are presented in Table 2. The experimentation results of the semantic similarity evaluation using various measures are presented in Table 3.
Quantitative evaluation of text generation
↑ means higher is better.
Semantic similarity evaluation
↑ means higher is better.
To assess the contributions of individual components, we conducted an ablation study. Removing the GNN layer reduced semantic similarity correlation by ∼9%, highlighting its role in capturing relational dependencies. Excluding the similarity loss term (
Ablation study of proposed model
Qualitative analysis
Case studies further illustrate the effectiveness of our approach. In Hindi and Mandarin news segments, where background noise and code-switching were common, the enhanced preprocessing pipeline yielded cleaner embeddings that preserved linguistic cues. The GNN successfully resolved semantic ambiguity by linking contextually related words (e.g.,
↔ “finance minister”), enabling more accurate cross-lingual similarity detection. Generated texts showed improved coherence, with fewer mistranslations and higher fidelity to original speech intent compared with baseline systems. The experimentation results are presented in Table 5.
Qualitative examples of semantic alignment in news speech-to-text
Comparative evaluation
The experimentation results comparing the proposed method against leading baselines, including mBART, Whisper, and other decoders, are presented in Figure 3. Our framework consistently outperformed across all evaluation metrics, particularly in low-resource scenarios where multilingual pretraining alone struggled. The integration of GNN-based relational refinement proved especially beneficial for languages with complex morphology, leading to superior alignment and generation quality. Also, the results of semantic similarity using different measures are presented in Figure 4.

Visual results demonstrating text generation performance.

Semantic similarity performance on different models.
Discussions
The experimental findings demonstrate that the integration of Transformer encoders with GNNs provides a significant advantage in multilingual news speech-to-text similarity modeling. Unlike conventional ASR pipelines, which primarily rely on direct acoustic-to-text mappings, the proposed approach leverages semantic alignment across languages. This ensures that paraphrased or contextually equivalent sentences are treated as consistent outputs, an essential capability for real-world news media where delivery styles, accents, and linguistic variations often differ.
One of the notable observations is the framework’s robustness to noise and variability. The preprocessing and speech enhancement stages, coupled with Transformer embeddings, significantly improved transcription quality even in challenging conditions such as accented speech (Hindi), low-resource morphology (Marathi), and prosodic variation (Tamil). These findings emphasize that the model does not merely replicate literal transcription but adapts to semantic fidelity, which is more relevant for information retrieval and summarization tasks in journalism and broadcasting. The ablation studies further revealed that both the Transformer and GNN components play complementary roles. The Transformer provides contextualized embeddings that capture long-range dependencies within speech, while the GNN enforces relational constraints between embeddings, thereby reducing inconsistencies across segments and languages. Comparative evaluations also confirmed that the proposed system outperforms traditional ASR baselines and end-to-end speech-to-text models, especially in low-resource and noisy scenarios.
In comparison with the existing methods, such models achieve strong performance in high-resource languages; they rely heavily on large pretrained corpora and struggle with domain-specific noise and low-resource language variability areas where our proposed approach demonstrates comparative robustness. However, unlike these large-scale models, our system is constrained by smaller training data and lacks end-to-end ASR decoding capability. However, some limitations remain. For instance, rare proper nouns, idiomatic expressions, and heavy code-switching still pose challenges, particularly in low-resource languages such as Marathi and Tamil. In addition, while semantic similarity modeling reduces literal transcription errors, it may at times overgeneralize, leading to loss of fine-grained lexical detail. These issues point to the need for future enhancements, such as incorporating multilingual pretrained LLMs and context-aware lexicons for named entity handling.
Conclusion
This work presented a novel framework that integrates Transformer encoders and GNNs for multilingual news speech-to-text similarity modeling. Unlike conventional ASR systems that focus primarily on literal transcriptions, the proposed approach emphasizes semantic fidelity, ensuring that contextually equivalent meanings are preserved across diverse languages and speaking conditions. The study demonstrated that news videos from English, Hindi, Marathi, and Tamil could be effectively processed through a multistage pipeline comprising speech enhancement, Transformer-based contextual embedding, and GNN-driven semantic alignment. Quantitative evaluations confirmed that the model achieved significant improvements in BLEU, METEOR, and semantic similarity scores compared with baseline ASR and end-to-end speech-to-text models. Ablation studies highlighted the complementary contributions of both Transformer and GNN modules, while qualitative results underscored the system’s robustness to noise, accent variation, and low-resource language challenges.
By capturing semantic meaning rather than surface forms, the proposed framework proved particularly useful for real-world applications such as multilingual news monitoring, journalism archiving, and cross-lingual media analytics. While limitations remain in handling rare proper nouns, idiomatic phrases, and code-switching phenomena, these can be addressed in future work by integrating large multilingual pretrained language models and domain-specific lexicons.
Authors’ Contributions
All the authors contributed equally to this work. However, the individual roles of each author are given as follows: J.J.: Data curation and visualization; S.K.: Formal analysis and writing—original draft; U.K.J.: Methodology; N.K.A.: Writing—modified draft; J.A.: Supervision and validation; A.V.: Writing—original draft.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
