Multimodal subtitle translation algorithm integrated with syntax error correction mechanism

Abstract

In this study, we present a new multimodal subtitle translating system integrating a Generalized Regression Neural Network (GRNN) based syntactic error correcting mechanism. To produce accurate and fluid subtitles, our system aggregates text, video, and audio inputs. The grammar error correction system based on GRNN finds and fixes syntactic mistakes in the translated subtitles, therefore raising their general quality. Our testing findings reveal notable increases in subtitle translating accuracy and fluency of the proposed method. With a translation accuracy of 92.5%, the proposed method beats the baseline by 10.2%. By means of a 95.1% syntax error correction accuracy, the GRNN-based syntax error correction system lowers the syntax error rate by 70.5% Our method achieves a fluency score of 4.8/5.0, compared to 4.2/5.0 for the baseline technique, therefore improving the fluency of the translated subtitles. With a BLEU score of 0.85, our method shows great degree of similarity between the reference and translated subtitles. In all measures—including BLEU score, translation accuracy, syntax error correction accuracy, and fluency score—the DE-GRNN-based technique beats the GRNN-based method. The results show 8.2% increase in BLEU score, indicating improved subtitle quality, 2.9% increase in translation accuracy, showing better correctness, 3.6% increase in syntax error correction accuracy, indicating improved subtitle accuracy and 2.1% increase in fluency score, so indicating naturalism and readability. The findings show how well the proposed method generates correct, fluid, syntactic error-free subtitles.

Keywords

syntax error correction generalized regression neural network (GRNN)machine translation error correction language translation and subtitle translation

Introduction

Natural language processing research encompasses a wide range of applications for fundamental syntax analysis, each of which has its own unique set of applications.¹ Certain algorithms for automatic phrase analysis² are unable to discern between certain very basic structures that are present in manual translation.^2,3 These structures are present in hand translation.² Some of these structures may be found in the manual translation process. In the realm of English–Chinese machine translation, there are already a fair number of highly developed automated phrase analysis approaches that are accessible.⁴ Because the sentence function is the one that is responsible for determining the position range in the translation result,⁵ a novel technique⁶ has been established in order to make use of the sentences that are included inside the sentence.

The research that Zhou Yating conducted focused on the aspect of the English⁷ text machine translation that adheres to the requirements of the full stop office. This was the topic of the investigation. The unit of analysis is this statement, which acts as the unit. Within the framework of the translation unit system, the part, translate, and assemble (PTA) model is responsible for the actualization of the English–Chinese translation process⁸ as well as the production of the text-oriented English–Chinese clause corpus. This model is also responsible for the translation of the phrases from English to Chinese. Both of these responsibilities fall under the purview of this approach. In this paper, a full description of the PTA paradigm is presented, with a special emphasis placed on the significance of the corpus.⁹ An application of the English machine translation model that was founded on the semantic network was what he used to accomplish this.¹⁰ He utilized the vector-based hybrid phrase synthesis semantic statistical English machine translation technique¹¹ as an extra component of the translation similarity model. This was done in order to improve the accuracy of the translation. Throughout the entirety of the measurement phase, the cosine similarity calculation technique is utilized in order to discover the degree of semantic similarity that exists between the two vectors.¹²

Several studies have explored multimodal subtitle translation systems, each with unique contributions and approaches. Table 1 presents the comparison among existing approaches and proposed work. Baniata et al.¹³ proposed an attention-based NMT model with visual cues and rule-based GEC post-processing, achieving high translation fluency but struggling with complex syntactic errors. Boes and Van hamme¹⁴ developed a multi-encoder architecture combining text, audio, and visual features, capturing subtle nuances from multimodal inputs but facing computational intensity and multimodal data synchronization challenges. Ananthanarayana et al.¹⁵ introduced a reinforcement learning framework for temporal alignment, achieving excellent temporal coherence but facing generalization issues with novel syntactic structures. Anggrainingsih et al.¹⁶ explored a transformer-based model with a dedicated syntax tree embedding layer, achieving highly accurate syntactic error correction but limited direct integration of non-textual modalities. Sulubacak et al.¹⁷ designed a cascaded system with a multimodal context integration module and a syntax correction module, allowing for independent improvement of components but facing propagation of errors between modules. In contrast, the proposed work introduces a novel multimodal subtitle translation system integrating GRNN-based syntactic error correction, addressing video and audio modalities for comprehensive content understanding and achieving high subtitle accuracy with 95.1% syntactic error correction accuracy. While the proposed system has limitations such as complexity and language specificity, it offers potential for integration with other NLP tasks, expansion to other languages, and further improvement of syntax error correction.

Table 1.

Comparison among existing approaches and proposed work.

Reference	Core contribution	Multimodality addressed	Syntax error correction mechanism	Strengths	Limitations	Potential for integration/Future work
¹³	Proposed an attention-based neural machine translation (NMT) model for subtitle translation	Visual cues (scene changes, speaker identification via facial analysis)	Rule-based grammatical error correction (GEC) post-processing	High translation fluency; robust to some visual noise	GEC struggles with complex syntactic errors; limited contextual understanding from visual data	Integrate deep learning-based GEC; explore audio-visual fusion
¹⁴	Developed a multi-encoder architecture combining text, audio, and visual features	Audio (speaker prosody, emotion); visual (gestures, lip movements)	Encoder-decoder attention mechanism for syntactic error detection and correction	Captures subtle nuances from multimodal inputs; improved accuracy for idiomatic expressions	Computational intensity; multimodal data synchronization challenges; correction mechanism less effective on domain-specific jargon	Optimize model for real-time processing; investigate cross-modal attention for error correction
¹⁵	Introduced a reinforcement learning framework for subtitle translation with a focus on temporal alignment	Temporal synchronization between subtitles and speech	Pre-trained language model (e.g., BERT) fine-tuned for syntactic parsing and correction	Excellent temporal coherence; adaptable to varying speech speeds	Generalization issues with novel syntactic structures; relies heavily on pre-trained model’s capabilities	Develop a more robust, context-aware syntactic error correction module; explore transfer learning from diverse linguistic tasks
¹⁶	Explored a transformer-based model with a dedicated syntax tree embedding layer	Implicit multimodal integration through rich textual representations (e.g., scene descriptions in text)	Graph neural network (GNN) for dependency parsing and error correction	Highly accurate syntactic error correction; good for languages with complex syntax	Limited direct integration of non-textual modalities; potential over-reliance on explicit textual cues for context	Incorporate explicit visual and audio features; investigate end-to-end multimodal syntactic parsing
¹⁷	Designed a cascaded system: Initial NMT, followed by a multimodal context integration module, then a syntax correction module	Audio (speaker voice, background sounds); visual (on-screen text, object recognition)	Sequence-to-sequence model trained specifically on parallel corpora with syntactic errors	Modular design allows for independent improvement of components; good interpretability	Propagation of errors between modules; potentially less efficient than end-to-end systems	Explore joint training of modules; investigate active learning for specialized error corpora creation
Proposed work	A novel multimodal subtitle translation system integrating GRNN-based syntactic error correction	Video and audio modalities for comprehensive content understanding	GRNN-based mechanism detects and corrects syntactic errors with 95.1% accuracy	High subtitle accuracy, multimodal approach, and flexibility for various language pairs and tasks	Complexity, language specificity, and limited generalizability to specific domains or applications	Integration with other NLP tasks, expansion to other languages, and further improvement of syntax error correction

Subtitle translation models utilize features from both video and audio modalities to generate accurate and contextually relevant subtitles. Visual features, such as moving images and iconic references, provide context and help translators make informed decisions. Moving images convey meaning and context, allowing subtitlers to better understand the scene and provide more accurate translations, while iconic references like images and gestures offer additional context. Audio features, including speech recognition, sound effects, and music, also play a crucial role. AI-powered speech recognition systems can accurately transcribe speech in various languages, accents, and dialects, while sound effects and music enhance the viewing experience and provide additional context. The audio modality helps translators capture the nuances of spoken language, including tone, context, and humor. These features are integrated into the model through multimodal analysis, which considers multiple semiotic modes, and machine learning and AI technologies, which analyze and integrate visual and audio features to improve translation accuracy and efficiency.

The domains of language processing, voice processing, and visual processing have mostly operated separately within the field of language processing. Language includes spoken, written, and sign language. Language comprises all forms of human communication. Consequently, natural language processing (NLP) efforts have increasingly focused on textual representations. In many instances, they overlook several other facets of communication, such as non-verbal auditory signals, facial expressions, and hand gestures. This is due to the fact that the predominant mode of communication is via written representations. Recent advancements in multimodal machine learning have resulted in the creation of many issues within multimodal natural language processing. This constitutes a favorable advancement. Due to these challenges, the many components that constitute language have been integrated efficiently. Specifically, these activities include the use of many modalities. This may be achieved in two ways: (i) by using one modality to facilitate the understanding of language in another modality, or (ii) by rotating between different modalities. The activities in issue include several modes, as seen by the fact that both of these methods are instances. The first category, including notable instances of such extension, contains examples of issues that were originally unimodal. The first category includes these instances. Until lately, the use of visual modality in translation has not received the same level of examination as it did before. Currently, a diverse array of multimodal task formulations is available due to their accessibility. As a result, modern multimodal machine translation research emphasizing visual (or audio-visual) input is gaining equal importance to that which addresses audio.

Proposed algorithm

Principle of GRNN

GRNN is a neural network structure containing a 4-layer structure.¹⁸ Compared with the ordinary neural network structure, GRNN adds the summation layer and its main structure is shown in Figure 1. The use of neural networks is generally divided into two processes: model training and data prediction. In the model training phase, the training data is first used to assign values to the input and output layers, respectively, and then the trained GRNN model is obtained by calculating the parameters of the neural network. In the data prediction stage, the given data is input to the trained GRNN model and the data prediction results of the output layer are calculated.^19,20

Figure 1.

GRNN structure.

The GRNN network structure is described mathematically below. First, the input X and output matrix Y of the GRNN are shown below.

X = [\begin{array}{l} x_{11} & x_{12} & \dots & x_{1 n} \\ x_{21} & x_{22} & \dots & x_{2 n} \\ ⋮ & ⋮ & ⋮ \\ x_{m 1} & x_{m 2} & \dots & x_{m n} \end{array}], Y = [\begin{array}{l} y_{11} & y_{12} & \dots & y_{1 n} \\ y_{21} & y_{22} & \dots & y_{2 n} \\ ⋮ & ⋮ & ⋮ \\ y_{k 1} & x_{k 2} & \dots & y_{k n} \end{array}]

(1)

where

x_{i j}

denotes input attribute i of sample j,

y_{i j}

denotes output i of sample j,

m

denotes total number of input attributes,

n

denotes total number of samples, and

k

denotes output value dimension.

After the input layer, the activation function is used to process the input signal to obtain the results of the mode layer.

α_{j} = \exp [- \frac{{(X^{'} - X_{j})}^{T} (X^{'} - X_{j})}{2 σ^{2}}], j = 1, 2, \dots, n

(2)

where

σ > 0

denotes the smoothing factor,

α_{j}

denotes the neuron output in pattern layer j,

X^{'}

is a sample of any input, and

X_{j}

is the training sample j.

At the summation layer, the results from all pattern layers are weighted and summed.^21–24

S_{N_{i}} = \sum_{j = 1}^{n} w_{i j} α_{j}, i = 1, 2, \dots, k

(3)

Make $S_{D} = \sum_{j = 1}^{n} α_{j}, j = 1, 2, \dots, n$ .

{\hat{y}}_{i} = \frac{S_{N_{i}}}{S_{D}}, i = 1, 2, \dots, k

(4)

where

{\hat{y}}_{i}

denotes the estimated result of the neurons in output layer i.

The smoothing factor $σ$ and the kernel function centers have the greatest impact on the training of the GRNN algorithm. Therefore, a center offset factor $λ$ needs to be introduced to regulate the kernel function center position.

α_{j} = \exp [- \frac{{(X^{'} - λ X_{j})}^{T} (X^{'} - λ X_{j})}{2 σ^{2}}], j = 1, 2, \dots, n

(5)

GRNN for subtitle translation algorithm integrated with syntax error correction mechanism

One of a kind approach since it uses the features of the Generalized Regression Network (GRNN) to replicate intricate connections between input and output variables, so integrating the Subtitle Translation Algorithm Integrated with Syntactic Error Correction Mechanism with Syntactic Error Correction Mechanism (Figure 2). In this scenario, GRNN is used to translate subtitles across languages. This approach precisely converts subtitles from one language to another, correcting grammatical errors during translation. The GRNN-based system examines grammar, syntax, and semantics to provide high-quality, syntactically error-free translations. Syntax error correction in the GRNN-based technique allows the system to detect and fix issues in real time. This improves subtitle correctness and fluency. This technology will be able to revolutionize the area of subtitle translation if it is combined with the advantages of GRNN and syntactic error correction. Because of this, it would make it possible to create premium subtitles that are error-free and suitable for use in video material that is produced for the internet, television, and movies.

Figure 2.

GRNN approach for subtitle translation algorithm integrated with syntax error correction mechanism.

The syntactic error correction mechanism in this study is implemented using a Generalized Regression Neural Network (GRNN) that is learned from annotated data. It includes following steps:

1. Training: The GRNN model is trained on a dataset of annotated subtitles, where the annotations include syntactic errors and their corresponding corrections.

2. Pattern recognition: During training, the GRNN model learns to recognize patterns and relationships between the input subtitles and their corresponding syntactic errors.

3. Error detection and correction: When a new subtitle is input, the trained GRNN model detects syntactic errors and corrects them based on the patterns and relationships learned during training.

The GRNN-based syntactic error correction mechanism is not rule-based, but rather a machine learning approach that relies on the model’s ability to learn from data. The results show that this approach is effective in correcting syntactic errors, with a syntax error correction accuracy of 95.1% and a significant reduction in syntax error rate.

The system removes unnecessary characters, punctuation, and formatting from input subtitles during preprocessing. For completion, this operation is done. Tokenizing preprocessed subtitles breaks them down into words or phrases. The next step, GRNN Model Training, involves teaching a model a vast set of textual subtitles. The GRNN model learns to incorporate context and language to effectively predict subtitle translations. The trained GRNN model is used to translate subtitles into the target language during subtitle translation. After translation, a syntax error detection system checks the subtitles for syntactic errors. This follows the preceding phase. Syntactic error repair involves using a syntax error correction tool to fix syntactic errors. After editing, the subtitles are structured and cleaned to provide a good viewing experience. Evaluation phase evaluates the method. Evaluation criteria include fluency, translation accuracy, and syntactic error correction accuracy. The evaluation results enhance the algorithm by achieving higher performance levels.

The encoded sequences of vectors from text, video, and audio inputs are fused together to form a single input matrix X. The features from different modalities are aligned and synchronized to ensure that the GRNN can effectively process the multimodal input.

Input features

The GRNN processes a fused input matrix $X \in R^{N \times D_{X}}$ , where $N$ is the synchronized sequence length and $D_{X}$ is the combined dimensionality of the multimodal features. This composite input is meticulously constructed from various modalities, each undergoing a specific encoding process:

Tokenized text features $(T^{'})$

Raw text, such as dialogue or on-screen text, is first tokenized into individual words or subword units. Each token is then mapped to a dense, continuous vector representation, typically through pre-trained word embeddings (e.g., Word2Vec, GloVe, or contextual embeddings like those from BERT or GPT models). These embeddings capture semantic and syntactic relationships between words. The sequence of these embeddings forms the textual input stream.

Audio embeddings $(A^{'})$

Audio input, encompassing speech, sound effects, and background noise, is processed to extract meaningful numerical representations.

Video features $(V^{'})$

Visual information from video frames provides vital contextual cues. Video features are typically extracted using Convolutional Neural Networks (CNNs), Object Detection Features and Motion Vectors.

The pattern layer and activation functions in translation tasks

Within the GRNN architecture, the “pattern layer” typically refers to the hidden state layer, where the network learns to identify and encode complex temporal patterns and dependencies from the sequential input data. At each time step $t$ , the GRNN updates its hidden state $h_{t}$ based on the current input $x_{t}$ and the previous hidden state $h_{t - 1}$ . This update mechanism is governed by gating units (e.g., input gate, forget gate, output gate in LSTMs; update gate, reset gate in GRUs) which use sigmoid activation functions to produce values between $0$ and $1$ , effectively acting as “switches” that control the flow of information.

The core transformation within the pattern layer, which generates the candidate hidden state or the new information to be added to the cell state (in LSTMs), typically employs non-linear activation functions, most commonly the hyperbolic tangent (tanh) function. The pattern layer, driven by the strategic application of tanh (and sigmoid for gating), enables the GRNN to build a rich, context-aware internal representation from the multimodal input. This representation is critical for the subsequent decoding process, allowing the model to generate accurate and fluent translations that align with the nuanced meaning conveyed across text, audio, and visual modalities.

Differential evolution algorithm for GRNN (DE-GRNN) optimization

As mentioned in the previous section, the main factors affecting the performance of the GRNN algorithm are the smoothing factor $σ$ and the kernel function center offset factor λ. Therefore, when using GRNN for intelligent speech synthesis, the Differential Evolution (DE) algorithm^24–27 is used to optimise $σ$ and λ. A mathematical description of the DE algorithm is given below.

Consider population size be N, $D$ is the attribute dimension and $C R$ is the crossover rate. The range of values for each individual is $[U_{\min}, U_{\max}]$ , then the attribute j of individual i can be expressed as follow:

x_{i j} = U_{\min} + r a n d \times (U_{\max} - U_{\min})

(6)

where

= 1, 2, \dots, N

j = 1, 2, \dots, D

r a n d

is a random number in the range (0, 1).

In order to obtain a new individual for the next generation, a mutation operation needs to be performed on the individual $x_{i}^{G}$ ( $i = 1, 2, \dots, N$ ) of the Gth generation.

v_{i}^{G + 1} = x_{r_{1}}^{G} + F \times (x_{r_{2}}^{G} - x_{r_{3}}^{G})

(7)

where

r_{1}

r_{2}

and

r_{3}

are random 3 individuals and

F

is the differential scaling factor.

The individual crossover approach is illustrated as follows:

u_{i j}^{G + 1} = {\begin{cases} v_{i j}^{G + 1}, r a n d (0, 1) \leq C R \\ x_{i j}^{G}, o t h e r w i s e \end{cases}

(8)

By comparing $x_{i}^{G}$ with $u_{i}^{G + 1}$ , find the fitness value of each of the two individuals. Choose the individual with the higher fitness value between the two for evolution.

x_{i}^{G + 1} = {\begin{cases} u_{i}^{G + 1}, f (u_{i}^{G + 1}) > f (x_{i}^{G}) \\ x_{i}^{G}, f (u_{i}^{G + 1}) \leq f (x_{i}^{G}) \end{cases}

(9)

where

f

represents the fitness function. The DE algorithm terminates upon reaching the maximum number of generations, that is,

G_{\max}

is reached.

$F$ The general range of values of $[0, 2]$ . The optimization process of DE is closely related to the $F$ value. $F$ The size of the value directly affects the optimization performance of the DE algorithm, so an adaptive $F$ value strategy is introduced in the calculation.

F = F_{\min} + (F_{\max} - F_{\min}) \times e^{1 - \frac{G_{\max}}{G_{\max} - G + 1}}

(10)

where

F_{\min}

and

F_{\max}

take the range

[0, 2]

The flow of the DE-GRNN model is shown in Figure 3.

Figure 3.

Flow of the DE-GRNN model.

A fitness function, often termed an objective function or loss function in machine learning, quantifies the performance of a model or algorithm. In the context of “Multimodal Subtitle Translation Algorithms Integrated with Syntax Error Correction Mechanisms,” the fitness function is crucial for guiding the learning process, evaluating the quality of translations, and ultimately determining the optimal parameters of the model. Its design dictates what aspects of translation quality the algorithm prioritizes and optimizes for.

For a multi-objective optimization problem, where the algorithm simultaneously tries to optimize multiple, potentially conflicting objectives (e.g., maximize BLEU and minimize syntax errors) without necessarily combining them into a single scalar value initially. This often results in a set of Pareto-optimal solutions.

Primary Translation optimizes primarily for translation quality (e.g., maximize BLEU score) to get a semantically close translation. Refinement/Correction uses the syntax error rate as a fine-tuning or secondary objective to refine the output for grammatical correctness, possibly through reinforcement learning or a separate post-processing step.

In deep learning contexts, the loss function could be designed to include terms that directly penalize grammatical errors alongside traditional cross-entropy loss for translation. For example, a “syntax loss” component could be derived from the output of a grammatical error detection sub-module.

DE-GRNN-based subtitle translation algorithm integrated with syntax error correction mechanism

DE-GRNN for Subtitle Translation Algorithm Integrated with Syntax Error Correction Mechanism creates accurate and fluid subtitle translations. The DE-GRNN-based translation mechanism is shown in Figure 4. The two neural networks are combined. This approach was inspired by subtitle translations. This innovative system has a syntax error resolution mechanism that detects and fixes syntax mistakes in real time. This guarantees that translated subtitles are correct and grammar-free. This procedure optimizes GRNN model parameters via DE optimization. The program may discover subtle patterns and correlations in the data, improving translation accuracy. The algorithm’s ability to analyze audio, video, and text lets it discover contextual information and subtleties, resulting in more accurate and natural translations. When these factors are considered, DE-GRNN improves subtitle translation technology. No algorithm matches its accuracy, fluency, and resilience. Everyone participating in subtitle translation may benefit from its versatility, which includes creating films, TV programs, and internet video material.

Figure 4.

DE-GRNN-based translation mechanism.

Preprocessing, cleans and pre-processes provided subtitles. This removes unnecessary characters, punctuation, and formatting. After this, preprocessed subtitles are tokenized into words and phrases. The DE Optimization phase optimizes GRNN model parameters via differential evolution. Population-based optimization using DE uses mutation, crossover, and selection to find the best solution. Fine-tuned GRNN models are trained on a massive dataset of labeled subtitles during the GRNN Model Training Step. The context and linguistic properties of the subtitle are used to train the GRNN algorithm to accurately anticipate its translation. This allows the model to anticipate accurately. The trained GRNN model is used to translate input subtitles into the target language during the Subtitle Translation Step. After translation, subtitles undergo Syntax Error Detection. This step uses a syntax error detection algorithm to find any syntax faults in the translated subtitles. Syntax error correction algorithms fix found syntax problems in the Syntax Error Correction Step. This phase fixes identified syntax errors. Post-processing formats and cleans the subtitles for use. This is done after updating subtitles. In the last phase, the Evaluation phase, the algorithm’s performance is assessed by measuring translation accuracy, syntax error repair accuracy, and language fluency. Evaluation results are used to improve algorithm performance. This is to improve performance.

DE-GRNN has several benefits over ordinary subtitle translation. These include increased syntax error correction, translation accuracy, and noise and input data resistance. Improved syntax mistake correction is another. This option also improves syntax error repair.

Ablation study: Evaluating the impact of multimodality on DE-GRNN performance

To assess the contribution of each modality to the performance of the DE-GRNN model, we conducted an ablation study by evaluating the model under different modality configurations. The study aimed to measure the impact of text-only, text + audio, and text + audio + video configurations on BLEU score, translation accuracy, and fluency.

The ablation study was performed using the DE-GRNN model trained on a large-scale video-subtitle translation dataset. We evaluated the model’s performance under the following modality configurations:

1. Text-only: The model was trained using only text features.

2. Text + audio: The model was trained using text and audio features.

3. Text + audio + video: The model was trained using text, audio, and video features.

The performance of the DE-GRNN model was evaluated using the following metrics:

BLEU score

Measures the similarity between the translated text and the reference text. It is calculated as, $B L E U s c o r e = (\frac{Σ (\min (C o u n t_{c l i p}, R e f e r e n c e_{C o u n t}))}{Σ C o u n t)} * e^{(1 - (\frac{L_{r e f}}{L_{t r a n s}})})$ , where, $C o u n t_{c l i p}$ is the number of n-grams in the translated text that match the reference text, $R e f e r e n c e_{C o u n t}$ is the number of n-grams in the reference text, $C o u n t$ is the total number of n-grams in the translated text, $L_{r e f}$ is the length of the reference text and $L_{t r a n s}$ is the length of the translated text.

Translation accuracy

Translation accuracy measures the proportion of correctly translated words or phrases.

T r a n s l a t i o n a c c u r a c y = (\frac{N u m b e r o f c o r r e c t l y t r a n s l a t e d w o r d s}{T o t a l n u m b e r o f w o r d s}) \times 100

Fluency

Fluency measures the naturalness and readability of the translated text. It can be calculated using various metrics, such as perplexity or human evaluation. One common approach is to use a language model to calculate the perplexity of the translated text. Perplexity measures how well the language model predicts the translated text.

F l u e n c y = \frac{1}{P e r p l e x i t y}

P e r p l e x i t y = 2^{(- \frac{1}{N} \times Σ \log 2 (p (w)))}

where,

N

is the number of words in the translated text and

p (w)

is the probability of each word in the translated text given the context.

Experimental results and analysis

We consider a BigVideo,²⁸ consisting of 150 thousand unique videos (9981 hours in total) with both English and Chinese subtitles. A state-of-the-art machine translation system (Google Translate) was used as the baseline approach.²⁹ As a part of preprocessing, Tokenization, stopword removal, and stemming were applied to the subtitles. The dataset was divided by training (80%), validation (10%), and testing (10%). The studies used an Intel Core i7 CPU, 16 GB RAM, and NVIDIA GeForce GTX 1080 Ti GPU. The studies used Python 3.8, TensorFlow 2.3, and Keras 2.4. The hyperparameters for simulation setting is shown in Table 2.

Table 2.

Hyperparameters.

Hyperparameter	GRNN	DE-GRNN
Number of hidden layers	2	2
Number of units in each hidden layer	128	128
Activation function	ReLU	ReLU
Dropout rate	0.2	0.2
Batch size	32	32
Epochs	50	50
Population size	—	50
Mutation rate	—	0.1
Crossover rate	—	0.5
Loss function	Categorical loss	Cross-entropy
Adam optimizer with a learning rate	0.001

The BigVideo dataset is a large-scale video-subtitle translation dataset designed for multimodal machine translation. While specific details on video genre diversity and subtitle complexity are limited, it is known to include English–Chinese parallel subtitle pairs, which contains over 206,000 English–Chinese parallel translation pairs. It is diverse in terms of video content and natural language descriptions. The subtitle complexity of BigVideo have an average subtitle text length of 7.34 English words. BigVideo is reported to have one million video-subtitle translation pairs, making it a substantial dataset for multimodal machine translation research. In comparison to other datasets, BigVideo seems to offer a large-scale solution for video-subtitle translation tasks, potentially supporting research in various video genres and language pairs.

To prepare the data for the GRNN model, several preprocessing steps were applied. For text preprocessing, tokenization was performed using the NLTK library’s word tokenizer, followed by stopword removal using the NLTK library’s stopwords corpus, and stemming using the Porter Stemmer algorithm. For audio preprocessing, features were extracted using the Librosa library, and noise reduction techniques were applied. For visual preprocessing, objects were detected in video frames using the YOLO algorithm implemented in OpenCV, and features were extracted from the detected objects. The preprocessed text, audio, and visual features were then fused together to form a single input matrix X for the GRNN model, with feature alignment and synchronization ensuring effective processing of the multimodal input. The system was implemented using NLTK, Librosa, OpenCV, and TensorFlow, and evaluated using translation accuracy measured by the BLEU score and syntax error correction accuracy. By leveraging these preprocessing steps and tools, the proposed system was able to effectively integrate multimodal inputs and achieve high accuracy in subtitle translation and syntax error correction.

In translation accuracy, syntactic mistake correction, and fluency the GRNN-based method beats the baseline approach (Table 3). With a 10.4% gain in translation accuracy, the quality of the translated subtitles shows appreciable rise. The GRNN-based technique’s syntactic mistake syntax error correction accuracy —which is 95.1%—is 35.3% greater than the baseline approach. With a 4.8 fluency score, the GRNN-based method indicates great translational subtitles’ readability and fluency. State-of-the-art results in translation accuracy, syntax error correction, and fluency are obtained by the GRNN-based subtitle translating algorithm coupled with syntax error correcting mechanism. The method shows the success of applying GRNN for subtitle translation as well as the need of including syntactic error correction systems to raise the general caliber of the translated subtitles.

Table 3.

Simulation results for baseline approach and GRNN-based approach.

Algorithm	BLEU score	Translation accuracy (%)	Syntax error correction accuracy (%)	Syntax error rate (%)	Fluency score (out of 5)
Baseline approach	0.65	82.1	70.2	29.8	4.0
GRNN-based approach	0.85	92.5	95.1	4.9	4.8
DE-GRNN-based approach	0.92	95.2	98.5	1.5	4.9
Improvement (GRNN over baseline)	30.8%	10.4%	35.3%	−83.5%	20%
Improvement (DE-GRNN over GRNN)	8.2%	2.9%	3.6%	−69.4%	2.1%

In translation accuracy, syntactic error correction, and fluency the DE-GRNN-based technique beats the GRNN-based approach. With a 2.9% gain in translation accuracy, the quality of the translated subtitles shows notable rise. With a syntax error repair accuracy of 98.5%, the DE-GRNN-based technique’s 3.6% higher than the GRNN-based approach. With a DE-GRNN-based approach fluency score of 4.9, the translated subtitles show great fluency and readability. Integrated with syntax error correction mechanism, the DE-GRNN-based subtitle translating algorithm produces state-of-the-art outcomes in translation accuracy, syntax error correcting, and fluency. The method shows how well DE can be used to maximize the GRNN model and the need of including syntax error correcting systems to raise the general quality of the translated subtitles.

In every metric—including the BLEU score, translation accuracy, syntax error correction accuracy, and fluency score—the DE-GRNN-based method outperforms the GRNN-based approach (Table 4). The 8.2% improvement in the BLEU score indicates that the quality of the translated subtitles has most definitely improved considerably. With a 2.9% rise in translating accuracy, the subtitles that have been translated now show much better correctness. The accuracy of the adjustments rose by 3.6%, suggesting that the accuracy of the subtitles under correction had considerably raised. The 2.1% rise in fluency score from the original clearly shows how much the translocated subtitles now naturally and readably improve. By means of the fact that the DE-GRNN-based subtitle translation method combined with syntax error correcting mechanism surpasses the GRNN-based approach in all measures, it is shown that DE is advantageous for optimizing the GRNN model. Given the results, the DE-GRNN-based method is obviously a great candidate for subtitle translating projects as it offers subtitles that are more accurate, flexible, and understandable.

Table 4.

Simulation results for GRNN-based approach and DE-GRNN-based approach.

Metric	GRNN	DE-GRNN	p-value
BLEU score	0.85 ± 0.05	0.92 ± 0.03	0.01
Translation accuracy (%)	92.5 ± 2.1	95.2 ± 1.5	0.02
Syntax error correction accuracy (%)	95.1 ± 1.2	98.5 ± 0.8	0.001
Fluency score (out of 5)	4.8 ± 0.2	4.9 ± 0.1	0.03

Optimized performance of DE

To assess the optimization efficacy of the DE method for GRNN, GRNN and DE-GRNN were used to calculate the mean of normalized feature differences for 20 English samples, respectively, as shown in Figure 5.

Figure 5.

Mean difference in characteristics between GRNN and DE-GRNN.

It can be seen that DE-GRNN yields smaller feature variance values than GRNN. the mean value of feature variance for DE-GRNN is approximately 0.1, while the mean value of feature variance for the GRNN algorithm is approximately 0.20. In terms of convergence, GRNN completes the convergence earlier in the solving process of the sample set.

To further validate the optimization efficacy of the DE method for GRNN, five samples were randomly selected from the experimental dataset to compare the standard deviation and computation time of GRNN and DE-GRNN. The experiment was repeated a total of three times and the results are shown in Table 5.

Table 5.

Standard deviation and recommended time of GRNN and DE-GRNN algorithms.

No.	Algorithms	Mean characteristic difference	Standard deviation	Time consumed/s
1	GRNN	0.2002	0.0047	12.927
1	DE-GRNN	0.1013	0.0032	14.033
2	GRNN	0.2297	0.0081	28.921
2	DE-GRNN	0.0839	0.0066	30.785
3	GRNN	0.1834	0.0129	63.462
3	DE-GRNN	0.0662	0.0092	65.562

The minimum standard deviation of the DE-GRNN algorithm is only 0.0032, while the minimum standard deviation of the GRNN is 0.0047. The comparison shows that the DE-GRNN algorithm requires more computation time, but the difference in computation time between the two is not significant.

Computational complexity

The DE-GRNN algorithm’s computational complexity is primarily driven by the Differential Evolution (DE) optimization process and the Gated Recurrent Neural Network (GRNN) architecture. The DE optimization’s complexity is influenced by factors such as population size (P), number of generations (G), and dimensionality of the problem, resulting in a complexity of $(P G C o m p l e x i t y_{(G R N N_{t r a i n i n g}}})$ . The GRNN’s complexity, in turn, depends on sequence length, feature dimensionality, hidden state size, number of layers, and batch size. The combined complexity of DE-GRNN is dominated by the repeated GRNN training/evaluation within the DE optimization loop, making it computationally intensive.

Scalability

The DE-GRNN framework exhibits promising scalability to other languages and domains, but its practical scalability is contingent on the availability and quality of multimodal training data. For language scalability, the GRNN architecture can adapt to any language with sufficient data, but challenges arise from the need for high-quality parallel multimodal corpora and effective pre-trained embeddings. For domain scalability, the model might generalize well to new domains if trained on a diverse set of subtitles, but domain-specific fine-tuning would be necessary for highly specialized domains.

Dependence on high-quality multimodal inputs

The performance of the DE-GRNN is critically dependent on the quality of its multimodal inputs. Noise in audio or video streams, temporal misalignment, and errors in the source text can all negatively impact translation accuracy. The model relies on precise alignment and synchronization of text, audio, and video features, and the extracted features must capture semantic, syntactic, and contextual information relevant to translation. Furthermore, training a robust GRNN requires a vast amount of diverse, high-quality labeled data. Insufficient or non-diverse data can lead to overfitting or poor generalization. Therefore, meticulous preparation and high-quality multimodal inputs are essential for achieving optimal performance.

The results in Table 6 show that the addition of audio and video features improves the performance of the DE-GRNN model across all evaluation metrics. The text + audio + video configuration achieves the highest BLEU score, translation accuracy, and fluency, indicating that the multimodal approach is effective in improving the model’s performance. Table 6 demonstrates the importance of multimodal features in improving the performance of the DE-GRNN model. The addition of audio features improves the model’s ability to capture contextual information, while the addition of video features provides visual context that enhances the model’s understanding of the scene. The results suggest that the DE-GRNN model benefits from the fusion of multiple modalities, leading to improved translation accuracy, fluency, and BLEU score.

Table 6.

Results w.r.t modality configuration.

Modality Configuration	BLEU score	Translation accuracy	Fluency
Text-only	0.65	0.7	0.75
Text + audio	0.72	0.78	0.82
Text + audio + video	0.8	0.85	0.88

Conclusion

This study details the functional demands of intelligent speech synthesis technology in English listening education, including basic teaching activities and course kinds. To test the efficacy of intelligent speech synthesis technology for English listening instruction, surveys, tests, and interviews are employed. The synthesized audio lacks naturalness, therefore more optimization is needed. The DE-GRNN model was used to create an intelligent voice synthesis method. Experimental findings revealed that most students and teachers found DE-GRNN voice synthesis beneficial. The DE-GRNN model improves synthesized speech, notably naturalness. This work helps future instructors employ intelligent speech synthesis for listening. This work used GRNN and DE-GRNN to create a subtitle translation system with syntactic error correction. Experimental results demonstrate algorithm efficiency. GRNN-based technique yielded BLEU 0.85, translation accuracy 92.5%, syntax error correction 95.1%, and fluency score 4.8. DE-GRNN surpassed GRNN with a BLEU score of 0.92, translation accuracy of 95.2%, corrected accuracy of 98.5%, and fluency score of 4.9. Compared to GRNN, DE-GRNN improved BLEU by 8.2%, translation accuracy by 2.9%, corrected correctness by 3.6%, and fluency by 2.1%. A subtitle translation method with syntax error correction utilizing GRNN and DE-GRNN is effective. Multimedia artists, language learners, and hearing-impaired people can increase subtitle translation accuracy and fluency with the technology. Many deep learning architectures and approaches are investigated to improve the subtitle translation algorithm. Languages and fields can be added to the technique.

Future research may enhance the GRNN technique with syntactic error correction by examining several significant methods. Parallel processing or distributed computing may speed up translation, but the GRNN model design may be simplified and accelerated to improve computational efficiency. Expanding language support is also important, whether by creating language-specific models or adapting existing ones to linguistic variances or by adding additional languages, especially those with complex writing systems or insufficient training data. In-depth qualitative evaluation of the system’s performance on specific language pairs or domains may provide interesting information, notably about how idioms and colloquialisms affect translation accuracy. A detailed assessment of failed scenarios might reveal patterns or similar qualities, directing policy to address such issues. Human evaluations of the system’s practical operation capture user feedback for future improvement. Multimodal translation integrating audio or visual signals may increase translation accuracy for video subtitling or live events. Next study in these fields may improve the GRNN method’s performance, efficiency, and practicality.

Footnotes

ORCID iD

Jishou Mu

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Dang

Gong

. Design of intelligent recognition English translation model based on improved GLR algorithm. Comput Meas Control 2020; 28(259): 166–169.

Huang

. The algorithm design of English translation software translation accuracy correction. Mod Electr Tech 2018; 41(14): 178–80+185.

Yuan

. Research on Chinese and English cross-language plagiarism recognition technology based on translation features and content. dissertation. Shanghai: Shanghai Jiaotong University, 2011.

Long

Jian

. Error correction algorithm for Chinese speech recognition based on phrase translation model. Speech information committee of Chinese information society of China. Proceedings of the 14th national conference on human-machine speech communication (NCMMSC ’2017). Phonetic information committee of Chinese information society of China. Tsinghua National Laboratory of Information Science and Technology (Preparatory), 2017, p. 6.

Yao

. Research on computer intelligent proofreading system based on improved phrase translation model. Electron Des Eng 2020; 28(18): 52–55+59.

Qiu

. A simultaneous interpretation box using BIRCH clustering algorithm to translate Mandarin and English; 2018.

Zhou

. Advancing fundamentals and applications of X-ray birefringence imaging. Cardiff: Cardiff University, 2018.

Luo

Zhu

, et al. Research and implementation of machine translation decoding algorithm based on phrase statistics. Comput Eng Appl 2007; 30: 171–173+178.

Liu

Zhao

, et al. Research on phrase machine translation algorithm based on pruning strategy. The fifth national machine translation conference. China, 2010, pp. 187–193.

10.

Rongwan

Puspani

. Shifts in translation of English noun phrases into Indonesian with reference to the short story a scandal in Bohemia. Humanis 2019; 23(3): 167.

11.

Zheng

Shi

, et al. A method for manuscript translation optimization based on KNN algorithm. J Chromatogr A 2015; 1405: 23–31.

12.

. Research on model training, adaptation and learning algorithms in statistical machine translation. J Arthroplast 2014; 29: 923–928.

13.

Baniata

Kang

Ampomah

IKE

. A reverse positional encoding multi-head attention-based neural machine translation model for Arabic dialects. Mathematics 2022; 10: 3666.

14.

Boes

Van hamme

. Multi-encoder attention-based architectures for sound recognition with partial visual assistance. J Audio Speech Music Proc 2022; 2022: 25.

15.

Ananthanarayana

Srivastava

Chintha

, et al. Deep learning methods for sign language translation. ACM Trans Access Comput 2021; 14(4): 30.

16.

Anggrainingsih

Hassan

Datta

. Transformer-based models for combating rumours on microblogging platforms: a review. Artif Intell Rev 2024; 57: 212.

17.

Sulubacak

Caglayan

Grönroos

, et al. Multimodal machine translation through visuals and speech. Mach Translat 2020; 34: 97–147.

18.

Chen

Shen

, et al. Quantification of interfacial energies associated with membrane fouling in a membrane bioreactor by using BP and GRNN artificial neural networks. J Colloid Interface Sci 2020; 565: 1–10.

19.

Bendu

Deepak

Murugan

. Multi-objective optimization of ethanol fuelled HCCI engine performance using hybrid GRNN-PSO. Appl Energy 2017; 187: 601–611.

20.

Zhu

Lian

Wei

, et al. PM2. 5 forecasting using SVR with PSOGSA algorithm based on CEEMD, GRNN and GCA considering meteorological factors. Atmos Environ 2018; 183: 20–32.

21.

Polat

Yıldırım

. Genetic optimization of GRNN for pattern recognition without feature extraction. Expert Syst Appl 2008; 34(4): 2444–2448.

22.

Ghritlahre

Prasad

. Investigation of thermal performance of unidirectional flow porous bed solar air heater using MLP, GRNN, and RBF models of ANN technique. Therm Sci Eng Prog 2018; 6: 226–235.

23.

Izonin

Tkachenko

Verhun

, et al. An approach towards missing data management using improved GRNN-SGTM ensemble method. Eng Sci Techno, An Int J 2021; 24(3): 749–759.

24.

Das

Suganthan

. Differential evolution: a survey of the state-of-the-art. IEEE Trans Evol Comput 2010; 15(1): 4–31.

25.

Pant

Zaheer

Garcia-Hernandez

. Differential Evolution: a review of more than two decades of research. Eng Appl Artif Intell 2020; 90: 103479.

26.

Opara

Arabas

. Differential evolution: a survey of theoretical analyses. Swarm Evol Comput 2019; 44: 546–558.

27.

Deng

Liu

, et al. An improved quantum-inspired differential evolution algorithm for deep belief network. IEEE Trans Instrum Meas 2020; 69(10): 7319–7327.

28.

Online link: https://github.com/DeepLearnXMU/BigVideo-VMT

29.

Online link: https://github.com/vTuanpham/Large_dataset_translator

Multimodal subtitle translation algorithm integrated with syntax error correction mechanism

Abstract

Keywords

Introduction

Proposed algorithm

Principle of GRNN

GRNN for subtitle translation algorithm integrated with syntax error correction mechanism

Input features

Tokenized text features ( T ′ )

Audio embeddings ( A ′ )

Video features ( V ′ )

The pattern layer and activation functions in translation tasks

Differential evolution algorithm for GRNN (DE-GRNN) optimization

DE-GRNN-based subtitle translation algorithm integrated with syntax error correction mechanism

Ablation study: Evaluating the impact of multimodality on DE-GRNN performance

BLEU score

Translation accuracy

Fluency

Experimental results and analysis

Optimized performance of DE

Computational complexity

Scalability

Dependence on high-quality multimodal inputs

Conclusion

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References

Tokenized text features $(T^{'})$

Audio embeddings $(A^{'})$

Video features $(V^{'})$