Construction of a cross-language text matching model based on SACNN

Abstract

The rapid progress of Internet technology has accelerated the development of natural language processing technology. To address the current issue of poor adaptability and accuracy in cross-language text matching and translation, firstly, a multi-head attention mechanism and convolutional neural network are introduced. Moreover, a cross-language text matching model based on similarity-based attention convolutional neural network is constructed. Then, visual features are added to the Transformer model to build a real-time machine translation model based on the improved Transformer. The results showed that the accuracy of the proposed text matching model could reach 83.42% when the epoch was 4. The proposed model achieved accuracy rates of 78.96%, 77.55%, and 79.86% in the experiment of matching French, German, and Spanish with English, respectively, while the accuracy rates were 79.16%, 75.03%, and 76.54% in the experiment of matching English with three languages. In addition, as the training data size increased from 1 M to 3 M, the Bilingual Evaluation Understud score of the proposed translation model improved by 36.45%, demonstrating good scalability. In summary, the model constructed in the study not only has high accuracy and adaptability but also demonstrates significant advantages in scalability.

Keywords

cross-language text matching machine translation deep learning attention mechanism

Introduction

The development of artificial intelligence technology has made significant progress in neural machine translation (NMT) technology, providing convenience for cross-cultural communication and new language learning. Machine translation (MT) model is the process of translating text from one language to another, typically requiring more complex language structure transformations, grammar rules, and semantic understanding.^1,2 The text matching (TextM) model refers to a model used to compare the similarity or correlation between two pieces of text. The cross-language text matching model (CL-TMM) is mainly used to handle the matching and alignment problems between texts from different languages, which involves understanding the semantics and structure between different languages.^3,4 Therefore, CL-TMM can be regarded as the foundation of the MT model to a certain extent. However, traditional text matching (TextM) models lack adaptability in cross linguistic scenarios and are difficult to effectively handle vocabulary, grammar, and cultural differences between different languages. In addition, existing models have limitations in capturing deep semantic and structural information of text, resulting in a need to improve matching accuracy and adaptability.⁵ In this context, this study builds CL-TMM based on similarity attention convolutional neural network (SACNN) and real-time MT model based on improved transformer (ITransRT). Compared with previous models, SACNN combines multi-head attention mechanism and CNN to more comprehensively capture the global semantics and local features of text, enabling the model to exhibit higher accuracy and adaptability in processing texts from different languages and domains. The objective of this research is to propose a novel solution to the challenge of cross-language TextM and translation. The aim is to enhance the ability to accurately and adaptively capture the semantic similarity and grammatical structure between texts. The significance of the research lies in improving the accuracy and adaptability of cross-language TextM and translation, better meeting practical needs, and providing new possibilities for real-time MT, making it perform better in processing multi-modal text.

The innovation of this study lies in: (1) combining the global information extraction capability of multi-head attention mechanism and the local feature extraction capability of convolutional neural network (CNN) to better capture semantic similarity between cross linguistic texts. (2) On the basis of the traditional Transformer model, visual features are added as auxiliary modalities, and the Wait-k strategy is introduced to compensate for the semantic loss caused by insufficient input information at the source end. (3) By utilizing a hierarchical attention mechanism and feature fusion module, the model effectively integrates image and text information, enhancing its ability to understand complex semantics.

The main structure of the research content consists of four parts. The first part is to analyze the current research status to point out the shortcomings of the current research. The second part focuses on the problem of cross-language TextM and translation, building a CL-TMM based on the SACNN and the ITransRT. The third part is to analyze the application effect of the research model. The last part is a summary of the research results, pointing out the shortcomings of the research and the prospects for future research.

Related works

TextM is an important foundational problem in natural language processing (NLP), which can be applied to a large number of NLP tasks, such as information retrieval, question answering systems, retelling questions, dialogue systems, and MT. Li et al. addressed the issue of images often lacking semantic concepts in cross-modal retrieval between images and text, and integrated semantic relationship information into visual and text features. A model was introduced to learn the common embedding space for aligning image and text descriptions, which has high efficiency and performance.⁶ Rossi et al. proposed the application of prefix free parsing to establish an r index to address the issue of precise pattern matching being used to support approximate pattern matching, but the r index cannot effectively support important queries. Meanwhile, thresholds related to the size of prefix free parsing were found in both linear time and space, proving that the proposed method has a fast index construction speed.⁷ Iqbal et al. reviewed many deep learning models used for text generation, focusing on the design and architecture of deep learning models and their application in NLP. They also summarized various models and provided a detailed understanding of the past, present, and future of text generation models in deep learning.⁸ The controllable text generation of pre-trained language models (PTLM) based on Transformer was difficult to ensure due to the limited interpretability level of deep neural networks. Zhang et al. conducted a systematic and critical review of the main methods, common tasks, and evaluation methods to address this issue. They also discussed the challenges faced in this field and proposed various promising future directions.⁹ Avrahami et al. proposed an accelerated solution for general image local text driven editing tasks by utilizing a potential text to image diffusion model to address the issue of relatively slow inference time in image processing using diffusion models. This scheme had good efficiency and accuracy.¹⁰ Hickman et al. stated that decisions made during text preprocessing can affect the capture of language content or style, the statistical ability of subsequent analysis, and the effectiveness of insights derived from text mining. Therefore, a review was conducted on the research of organizational text mining and computational linguistics to provide experiential decision recommendations for text preprocessing.¹¹

NLP is the bridge between machine language and human language to achieve the purpose of human-machine communication and is the foundation of MT. Khurana D et al. stated that NLP has been widely applied in fields such as MT, information extraction, and question answering. They discussed the different levels of NLP and the components of natural language generation to distinguish the four stages, and also introduced the history and evolution of NLP, as well as the current trends and challenges.¹² Min B et al. pointed out that the key idea of large-scale PTLM is to learn a universal language representation from a universal task once. Then, it was applied in different NLP tasks, and the key basic concepts of the large-scale PTLM architecture were introduced, and the transition to NLP technology driven by large-scale PTLM was comprehensively introduced.¹³ Ranathunga et al. provided a detailed introduction to the research progress of LRL on NMT, addressing the issue of poor performance of NMT on low resource language (LRL) pairs. Through quantitative analysis, technical guidelines were provided for the selection of data settings for a given LRL.¹⁴ Li et al. addressed the issue of unsupervised multi-modal MT models being sensitive to false correlations and adopted multi-modal reverse translation. It used spatio-temporal maps obtained from videos to utilize object interactions in space and time to eliminate ambiguity and used visual center subtitles as additional weak supervision. This model had good translation and generalization abilities.¹⁵ Zhang et al. proposed a frequency aware token level contrastive learning method for low-frequency word prediction in NMT systems by pushing the hidden state of each decoding step away from the corresponding words of other target words. This method had good robustness while maintaining high accuracy in predicting low-frequency words.¹⁶ Chung E S et al. addressed the lack of research on the comprehensive evaluation of the effectiveness of MT and examined the impact of MT use on second language writing among learners through automatic calculation tools and manual raters. They also investigated the impact of proficiency and text type on learners’ use of MT, verifying that MT helps improve accuracy.¹⁷

In summary, although cross-language TextM tasks have made positive progress, in practical situations, TextM still faces problems such as poor language adaptability and high cost of data annotation for new languages. At present, mainstream methods rely on large-scale pre training to capture deep semantics, with high inference costs and insufficient explicit modeling of syntactic structure changes, such as BERT and its multilingual variant mBERT. SACNN combines lightweight multi-head attention and CNN to enhance robustness against word order differences, achieving effective fusion of global semantic associations and local structural features, which is more conducive to deployment in resource sensitive scenarios. Therefore, the research on constructing a CLTMM based on SACNN has important practical application value and prospects in balancing model performance and cross language robustness.

CLTMM and NMT research

The development of the Internet has accelerated the worldwide information flow and resource sharing, but the communication barriers caused by cultural differences and language differences have blocked the process of information exchange around the world. To improve the adaptability and accuracy of cross-language TextM and translation, this study builds SACNN-based CL-TMM and ITransRT.

Construction of CL-TMM based on SACNN

The TextM algorithm aims to determine the relationship between two texts by comparing their similarity and is an important component of a question answering system. Cross-language TextM refers to the task of matching and comparing text from different languages. In the field of NLP, cross-language TextM is an important issue because there are differences in vocabulary, grammar, and culture between different languages, so it is necessary to cross these differences for TextM. The study innovatively combines multi-head self-attention mechanism (SAM) and improved CNN to extract global and fragment information of sentences, and uses fully connected networks for feature fusion. Cross language text matching requires simultaneous processing of semantic correspondences at the lexical level and structural differences at the syntactic level. SACNN utilizes SAM to capture global semantic information of text, effectively understanding the overall semantics of texts in different languages. Simultaneously utilizing CNN to extract local features can handle syntactic differences and phrase structure correspondence issues. The combination of the two enables the model to understand text more comprehensively, especially for the challenge of cross linguistic text matching, which is superior to traditional architectures that only rely on local features, such as Bilinear Convolutional Neural Network. The CL-TMM based on SACNN proposed in this study mainly includes three matching modules: multi-head SAM, convolutional features matching, and fusion output. The structure of SACNN is shown in Figure 1.

Figure 1.

Structure diagram of SACNN.

Multi-head SAM is a mechanism used in deep learning to enhance the model’s ability to pay attention to input information. It is initially introduced into attention models and has been widely used, especially in Transformer models.¹⁸ The attention mechanism is proposed to address the problem that during the training process of recurrent neural network models. If the information transmission process is prolonged, the model will forget the information transmitted in the previous time steps. The calculation process is shown in Figure 2.

Figure 2.

The calculation process of attention mechanism.

In traditional attention mechanisms, models learn to weight and combine information from different positions in the input sequence to incorporate the information carried by each position into the final representation. Multi-head SAM further expands this concept. It allows the model to simultaneously focus on different parts of the input and integrate these focused results, thereby enabling a more comprehensive understanding of the input sequence. Multi-head SAM can extract global information within a sentence, and its formal description is formula (1).

{\begin{cases} M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, . . ., h e a d_{h}) W^{O} \\ h e a d_{1} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{cases}

(1)

In formula (1),

Q

K

, and

V

represent the sets of query, key, and values, respectively.

C o n c a t (x_{1}, . . ., x_{n})

represents the connection function.

W

represents the parameter output matrix. Residual connections are introduced during the calculation, and the weights of each layer are regularized. The output formalization of the sub layers is formula (2).

L a y e r N o r m (x + S u b l a y e r (x))

(2)

In formula (2),

L a y e r N o r m (x)

represents the layer regularization function.

S u b l a y e r (x)

represents the function interface of the sub layer. In the multi-head SAM matching module, two calculation methods, difference and dot multiplication, are used to measure similarity, as shown in formula (3).

{\begin{cases} V_{d i f f} = | V_{l e f t} - V_{r i g h t} | \\ V_{d o t} = V_{l e f t} \cdot V_{r i g h t} \end{cases}

(3)

In formula (3),

V_{d i f f}

represents the absolute value of the difference between the text feature vectors (TFVs) on both sides.

V_{l e f}

represents the left TFV.

V_{r i g h t}

represents the right TFV.

V_{d o t}

represents the dot product result of TFVs on both sides. The stacking level of the multi-head SAM module is set to 3, and the dimensions of query, key, and values are set to 32. The convolutional feature module is responsible for extracting fragment features of the text. The schematic diagram of the multi-head SAM and the convolutional module is shown in Figure 3.

Figure 3.

Schematic diagram of multi-head SAM and CNN feature module.

Firstly, the text on both sides is fed into a convolutional layer for computation. Assuming the word vector in the sentence is $v_{1}, v_{2}, . . ., v_{n}$ , to use convolutional weights to generate feature vectors as shown in formula (4).

{\begin{cases} p_{i} = \tanh (W \cdot c_{i} + b) \\ \tanh (x) = \frac{\sinh (x)}{\cosh (x)} = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} \end{cases}

(4)

In formula (4),

p_{i}

represents the eigenvector.

W

represents convolutional weight.

v_{1},, . . .,

represents the filter size.

b

represents bias.

\tanh

represents the activation function. Then, assuming that the sentence length on both sides of Padding is

l

, to use attention mechanism to expand the convolutional feature map, as shown in formula (5).

{\begin{cases} A_{i, j} = m a t c h - s c o r e (F_{0, r} [:, i], F_{1, r} [:, j]) \\ m a t c h - s c o r e (x, y) = \frac{1}{1 + | x - y |} \\ F_{0, a} = W_{0} \cdot A^{T}, F_{1, a} = W_{1} \cdot A \end{cases}

(5)

In formula (5),

A \in R^{l \times l}

represents the attention matrix.

m a t c h - s c o r e

represents the matching score.

F_{i, r}

represents the feature map of sentence

i

i = 0

represents the left sentence.

i = 1

represents the sentence on the right. After the convolution and attention calculations are completed, the convolutional feature maps and attention feature maps are concatenated and pooled to maintain high-dimensional feature invariance, reduce parameter size, and prevent model overfitting. This study adopts average pooling, as shown in formula (6).

S_{i j} = \frac{1}{c^{2}} (\sum_{i = 1}^{c} \sum_{j = 1}^{c} F_{i j}) + b_{2}

(6)

In formula (6),

S_{i j}

represents the sub-sampled feature map.

c

represents the moving step size of pooling.

F_{i j}

represents the input feature map matrix.

F_{i j}

represents the bias value. Finally, using formula (3), the difference similarity features and point product similarity features of the left and right high-dimensional vectors are calculated and concatenated. The concatenated results are sent to the feature fusion fully connected network for calculation. The fusion output module is responsible for fusion calculation of the matching results between the multi-head SAM module and the convolutional feature module. A fully connected network with 32 hidden neurons is used for feature fusion. The Softmax function is used for activation, using the crossover function as the loss function of TextM, as shown in formula (7).

{\begin{cases} σ {(z)}_{j} = \frac{e^{z j}}{\sum_{k = 1}^{K} e^{z k}}, j = 1, . . ., k \\ b i n a r y_l o s s = - \frac{1}{n} \sum_{i = 1}^{n} y^{i} \log y_{p r e}^{i} + (1 - y^{i}) \log (1 - y_{p r e}^{i}) \end{cases}

(7)

In formula (7),

z_{j}

represents the predicted probability of each node in the output layer.

n

represents the total number of samples.

y^{i}

represents the true label.

y_{p r e}^{i}

represents the predicted label. In summary, this study introduces multi-head SAM and CNN to integrate global and fragment information of text, and builds a CL-TMM based on SACNN. The specific process of this model is Figure 4.

Figure 4.

Flowchart of a CL-TMM based on SACNN.

Construction of the ITransRT model

After implementing cross language TextM, further MT can be carried out. Real-time MT refers to the real-time conversion of text or speech from one language to another through a computer system to achieve real-time cross-language communication. It is one of the most difficult MT tasks to handle. The limited resources provided by traditional MT inputs can to some extent affect the translation results. The Transformer model is a deep learning model based on attention mechanism, widely used in NLP tasks, and has achieved significant breakthroughs in handling long-distance dependencies and modeling sequence information.^19,20 Therefore, based on the Transformer model, this study introduces images corresponding to the source language text description as auxiliary modalities to enrich translation resources and compensate for semantic deficiencies caused by insufficient input information at the source end. Specifically, for the input source language text, this study utilizes a pre trained visual model to extract the global visual feature vector of its corresponding image. Visual features corresponding to the text are added to the decoder, and visual feature encoding, positional encoding, and lexical encoding are combined to assist translation to improve translation quality. The introduction of visual features provides additional semantic information for translation models, especially when processing multimodal text, which can significantly improve the accuracy and fluency of translation. The structural framework of the Transformer model is shown in Figure 5.

Figure 5.

The structural framework of the Transformer model.

In traditional NMT tasks, the decoder generates a vocabulary at each time step, inputs the sequence $x = (x_{1}, x_{2}, . . ., x_{n})$ of the source statement, and generates a hidden state $H = f (x) = (H_{1}, H_{2}, . . ., H_{n})$ . The encoder of the model predicts the next output $y_{t}$ based on the hidden state of the source statement and the previously generated words, and finally obtains the probability $y = {y_{1}, y_{2}, . . ., y_{m}}$ , as shown in formula (8).

P (y | x) = \prod_{t = 1}^{| y |} P (y_{t} | x, y < t)

(8)

Traditional NMT may lead to premature generation of incomplete or incorrect translations, hence the Wait-k strategy has emerged. The Wait-k strategy is a strategy used to improve the NMT decoding process by delaying generation and waiting for more information to be transmitted to improve translation accuracy and fluency. Specifically, the Wait-k strategy requires the decoder to generate only one special “wait mark” as output at each time step, rather than directly generating vocabulary. Once $k$ consecutive waiting tags are generated, the decoder will be triggered to generate the actual vocabulary. The decoding probability of the Wait-k strategy is formula (9).

{\begin{cases} P_{w a i t - k} (y | x) = \prod_{t = 1}^{| y |} P (y_{t} | x < g (t), y < t) \\ g (t) = \min {k + t - 1, n} \end{cases}

(9)

In formula (9),

g (t)

represents the function of the Wait-k strategy, which is the number of source words that the encoder needs to refer to when generating the target word

y_{t}

at time

t

. When

k + t - 1

exceeds the length

n

of the source statement, the length will be fixed to

n

. In addition, this study adds a new layered attention scheme in the decoding section to assist the model in fusing image and text information. Firstly, two different modal feature information vectors are obtained through the text encoder and visual feature extraction model, and the corresponding context vectors are obtained through a layer of attention mechanism, as shown in formula (10).

{\begin{cases} e_{i, j}^{f} = F^{f} (s i, h_{j}^{f}) \\ α_{i, j}^{f} = \frac{\exp (e_{i, j}^{f})}{\sum_{l = 1}^{| h^{f} |} \exp (e_{i, j}^{f})} \\ c_{i}^{f} = \sum_{j = 1}^{| h^{f} |} α_{i, j}^{f}, h_{i}^{f} \end{cases}

(10)

In formula (10),

F^{f}

represents the feed-forward neural network corresponding to each mode

f

s_{i}

represents the hidden state of the

i

-th encoder.

h_{j}^{f}

represents an information vector.

c_{i}^{f}

represents the context vector. Then, the context vectors of the text and image are mapped to the same space, and the second layer attention mechanism is used to calculate another distribution and corresponding weighted average of these mapped vectors, as shown in formula (11).

{\begin{cases} {\tilde{e}}_{i}^{f} = F (s_{i}, c_{i}^{f}) \\ β_{i}^{f} = \frac{\exp ({\tilde{e}}_{i}^{f})}{\sum_{r \in {i m g, t x t}} \exp ({\tilde{e}}_{i}^{f})} \\ \tilde{c} i = \sum_{r \in {i m g, t x t}} β_{i}^{f} W^{r} c_{i}^{r} \end{cases}

(11)

In formula (11),

β_{i}^{f}

represents the fusion of text and image features.

\tilde{c} i

represents the context vector obtained after feature fusion.

W^{r}

represents the weight matrix corresponding to different modalities. Due to the use of the Wait-k strategy, the calculation of the context vector in this study is formula (12).

{\hat{c}}_{i}^{t x t} = \sum_{j = 1}^{g (t)} α_{i, j}^{t x t} h_{j}^{t x t}

(12)

In formula (12),

{\hat{c}}_{i}^{t x t}

is calculated from the input text, and the input text is determined by

g (t)

. Finally, the decoding probability is obtained by combining text information and visual features, as shown in formula (13).

P (Y | X, Z) = \prod P (y_{t} | X_{\leq g (t)}, Z, y_{< t})

(13)

In summary, this study builds a real-time MT model based on the Transformer model, combined with the Wait-k strategy and hierarchical attention scheme. The specific structure is displayed in Figure 6.

Figure 6.

Real-time MT model structure diagram.

In summary, the improvement of the standard Transformer architecture in this study is mainly reflected in three aspects. Firstly, introducing visual features related to the text as auxiliary modalities and integrating them with text features to enrich translation resources and compensate for semantic deficiencies. Then, a hierarchical attention scheme is adopted to encode the text and image features separately, and the feature weights are dynamically adjusted through a multi-layer attention mechanism to achieve more effective information fusion. Finally, combined with the Wait-k strategy, the generation process of the decoder is delayed, waiting for more contextual information to be passed in, thereby improving the accuracy and fluency of the translation.

Analysis of the effect of TextM and real-time MT

This study constructs CL-TMM and ITransRT based on SACNN, but their practical application effects still need further verification. This study mainly analyzes from two aspects. Firstly, the effectiveness of CL-TMM is analyzed, and then the feasibility of the real-time MT model is verified.

Analysis of the effect of CL-TMM

To verify the effectiveness of the TextM model based on SACNN, this study used the English Spanish and Spanish English matching question datasets provided in the CIKM AnalyticCup 2018 competition. Using Google Translate’s API, the dataset was translated into French and German versions. The CIKM AnalytiCup 2018 dataset contains 20,000 English question pairs and 1400 Spanish question pairs, all of which are manually annotated and matched by language experts. Remove samples with a length greater than 50 words after translation, and filter samples containing special symbols or invalid characters. However, relying on API to translate datasets may lead to the model learning specific translation styles or biases, thereby affecting the model’s generalization ability. Therefore, this study manually validated the translated data to identify and correct possible errors or unnatural translations. Four TextM models were independently trained for English, Spanish, French, and German. 80% of each language was extracted as training data and 20% as testing data to test the SACNN TextM model. It was compared with three models: bidirectional long short term memory network (Bi-LSTM), bilinear convolutional neural network (BCNN), and attention-based convolutional neural network (ABCNN). The accuracy of the each models on the English TextM dataset is shown in Figure 7. Compared to the other three models, the accuracy of the model in this study is the highest, reaching 83.42% at an epoch of 4, followed by the ABCNN model, and the Bi-LSTM mode has the lowest accuracy. In addition, as the number of iterations increases, the accuracy of the research model fluctuates less and has good stability. The results indicate that the SACNN model has good accuracy and stability.

Figure 7.

Accuracy of various models on English TextM datasets.

The TextM accuracy results of the above four models in four languages are shown in Figure 8. In Figure 8(a), SACNN has the highest matching accuracy in the English dataset, at 83.22%. In Figure 8(b), the matching accuracy of SACNN in French is higher than the other three models, at 84.47%. In Figure 8(c), the matching accuracy of SACNN in German is higher than the other three models, at 78.63%. In Figure 8(d), SACNN has a higher matching accuracy in Spanish, but slightly lower than the ABCNN model at 80.21%, which is 0.15% lower than the ABCNN model. The results show that the proposed SACNN TextM model has good overall language matching performance and high accuracy.

Figure 8.

TextM accuracy results in four languages.

To further validate the effectiveness of SACNN based CL-TMM, this study conducted cross-language TextM experiments using the aforementioned dataset, and the results are shown in Figure 9. In the experiment of matching French, German, and Spanish with English in Figure 9(a), the accuracy of the research model remained the highest, with 78.96%, 77.55%, and 79.86%, respectively. In Figure 9(b), in the experiment of matching French, German, and Spanish with English, SACNN still had the highest accuracy, with 79.16%, 75.03%, and 76.54%, respectively. The results showed that the CL-TMM based on SACNN had good adaptability in different fields, and the matching accuracy was also better than the comparison model, which had certain feasibility and superiority.

Figure 9.

Results of CL-TMM experiment.

To verify the efficiency of the TextM model, compare the training time of the four models mentioned above. The efficiency and model size comparison results of each model are shown in Table 1. Among the four models, the research model has the largest size, which is 5.2 M. The training time and inference time are also relatively long, which are 3162 s and 150 s, respectively, but still within an acceptable range.

Table 1.

The efficiency and model size comparison results of each model.

Text matching model	Training time/s	Inference time/s	Model size/M
BCNN	2465	120	4.7
ABCNN	3033	135	4.8
Bi-LSTM	1983	100	3.5
SACNN	3162	150	5.2

Further compare the precision, recall, and F1 score of the four models mentioned above, and use paired t-test to evaluate whether the performance difference between the proposed SACNN model and other models is statistically significant. Set the significance level α to 0.05. If p < 0.05, it is considered that the performance difference of the model is statistically significant. Conduct five independent experiments on each model, record the indicator values for each experiment, and calculate the mean and standard deviation to evaluate the stability and significance of the model performance. The statistical significance test results are shown in Table 2. From Table 2, it can be seen that the average precision, recall, and F1 score of the proposed SACNN model are 88.46%, 86.32%, and 87.54%, respectively, which have significant advantages over the baseline model (p < 0.05).

Table 2.

Statistical significance test results.

Text matching model	Precision/%	Recall/%	F1/%	p (vs. SACNN)
BCNN	76.42 ± 4.01	74.62 ± 5.10	75.07 ± 4.02	<0.05
ABCNN	81.29 ± 3.15	79.85 ± 3.84	80.49 ± 3.63	<0.05
Bi-LSTM	78.46 ± 3.53	75.44 ± 4.07	76.61 ± 3.42	<0.05
SACNN	88.46 ± 2.34	86.32 ± 2.73	87.54 ± 1.86	–

To investigate the individual contributions of different components in the proposed SACNN based cross lingual text matching model, this study conducted ablation experiments. Using cosine similarity, semantic textual similarity (STS), and BERTScore as evaluation metrics for semantic similarity. Compare the complete model, the model without multi-head SAM, the model without CNN, and the model without fusion output module. The results of the ablation experiment are shown in Table 3. From Table 3, it can be seen that the complete model performs better in cosine similarity, STS, and BERTScore metrics, with values of 0.85, 0.85, and 0.79, respectively, indicating that the different modules of the proposed model can effectively improve the text matching performance of the model.

Table 3.

Results of ablation experiment.

Model	Cosine similarity	STS	BERT score
Without multi-head SAM	0.75	0.73	0.70
Without CNN	0.80	0.79	0.75
Without fusion output module	0.78	0.78	0.73
Complete model	0.85	0.85	0.79

To further validate the superiority of the proposed TextM model, this study evaluates its accuracy in English and French, and compares it with the Bidirectional Encoder Representation from Transformers model combined with the SimCSE framework (SimCSE-BERT), the Text Semantic Matching Model based on Knowledge Enhancements (TSM-KE), and the BERT model combined with the Linguistic Knowledge Enhanced Graph Transformer (LET-BERT). These three comparative models are currently three advanced TextM models. The result is shown in Figure 10. From Figure 10(a) and (b), in the TextM tasks of English and French, the matching accuracy of the research model is the highest, with 83.46% and 85.86%, respectively. The results indicate that the CL-TMM based on SACNN has high matching accuracy, and has certain feasibility and superiority.

Figure 10.

Comparison results of matching accuracy.

Feasibility analysis of real-time MT model

To verify the performance of ITransRT, this study uses bilingual evaluation study (i.e., BLEU) as the evaluation metric and tested it using the Multi30 K dataset. The Multi30 K dataset contains approximately 30,000 images and their corresponding textual descriptions in English, German, French, and other languages, with over 100,000 sentences. The images are mainly from the Flickr30k and MSCOCO datasets, covering daily life scenes. The image undergoes standardization processing, resizing, and normalizing pixel values. The text description has undergone preprocessing steps such as word segmentation and stop word removal, and has been fused with corresponding image features. The comparison with the traditional Wait-k strategy is shown in Figure 11. In Figure 11(a), in the French-English translation task, as the K value increases, the BLEU of both models increases. In Figure 11(b), in the English-French translation task, the BLEU indicator of the research model is higher. As the K value increases, the difference in indicators between the two models gradually decreases and even tends to be consistent. In Figure 11(c), the research model performs better in terms of indicators in the German-English translation task. In Figure 11(d), there is no significant difference in indicators between the two models in the English-German translation task when K = 3. In summary, ITransRT has higher BLEU indicators in most cases, which can effectively improve the quality of real-time MT and has certain feasibility and effectiveness.

Figure 11.

BLEU score results for different translation tasks.

To verify the adaptability of the translation model in Chinese-English translation, this study conducts experiments using UNv1.0 bilingual parallel corpus. The 1 M, 2 M, and 3 M parallel sentence corpora are selected as training sets for experiments. 5k sentences are used as the test set, and 8k sentences are used as the validation set. Python is used as the development language, and Jieba and NLTK tools are, respectively, used to process Chinese and English word segmentation. The vocabulary size for both Chinese and English is set to 30k. The comparison results with traditional Transformer and LSTM translation models are shown in Figure 12. As the size of the training set increases, the BLEU index of all three translation models increases. The BLEU index of the research model is the highest at the scale of 1 M, 2 M, and 3 M training sets, with values of 20.19, 26.65, and 27.55, respectively. The results indicate that although the performance of the model improves with increasing training set size, the BLEU score is still low at the minimum training set size, indicating the performance limitations of the model when the data volume is small. Future research can focus on optimizing model architecture, increasing diversity of training data, and improving feature fusion strategies to further enhance translation quality.

Figure 12.

The BLFU score results of three models for Chinese-English translation.

To further validate the effectiveness of the MT model, this study provides different proportions of correct translations to the decoder, and then evaluates the output results. The test results under different training set sizes are shown in Table 4. As the accuracy of the input translation gradually increases, the BLEU score of the model also gradually increases. When the accuracy rate of the input translation is 40%, the BLEU scores of the models in all three training set sizes are the highest, at 52.79, 55.93, and 57.17, respectively. Therefore, as the amount of data increases, the performance of the model will become better and better, and data is crucial for the effectiveness of the model.

Table 4.

Translation results with different proportions of correct translations.

Training set size	The proportion of correct translations input/%
Training set size	0	10	20	30	40
1M	20.17	22.33	30.73	39.15	52.79
2M	24.66	26.18	34.26	45.64	55.93
3M	27.57	30.27	38.55	46.89	57.17

Conclusion

The advancement of technology has brought about the flourishing development of NLP technology. In response to the problem of cross-language TextM and translation, this study innovatively combined multi-head SAM and CNN to build a CL-TMM, and added visual features to the Transformer model to build a real-time MT model based on an improved Transformer. The experiment showed that the accuracy of the SACNN model was the highest, reaching 83.42% when the epoch was 4. With the increase of iteration times, the accuracy fluctuated less and had good stability. The accuracy of the SACNN model in English, French, and German matching was higher than the other three models at 83.22%, 84.47%, and 78.63%, respectively. The accuracy of SACNN-based CL-TMM was the highest in the experiment of matching French, German, and Spanish in English, with 78.96%, 77.55%, and 79.86%, respectively. Its accuracy in matching English in French, German, and Spanish was still the highest, at 79.16%, 75.03%, and 76.54%, respectively. The BLEU metric of ITransRT was higher, with BLEU metrics of 20.19, 26.65, and 27.55 for 1 M, 2 M, and 3 M training sets, respectively. When the accuracy rate of the input translation was 40%, the BLEU scores of the models in all three training set sizes were the highest, at 52.79, 55.93, and 57.17, respectively. In summary, the constructed model has certain feasibility and effectiveness. The proposed SACNN model and improved Transformer model have good generalizability in other language pairs, domains, and text types. Without extensive retraining, adaptability to new languages or domains can be enhanced through the universality of multilingual pre training and feature extraction. The SACNN model, through multi-head SAM and CNN, can handle lexical and grammatical differences in different language pairs, adapt to formal and informal texts, and can be used for multilingual social media monitoring and multi domain literature retrieval. Meanwhile, the improved Transformer model, by incorporating visual features and hierarchical attention schemes, combined with the Wait-k strategy, can effectively improve translation quality and adaptability, making it suitable for multilingual real-time communication and multi domain document translation.

However, while the SACNN model has improved accuracy compared to the baseline model, its model size and training time are also higher than some baseline models, which poses challenges for deployment in resource constrained environments. In addition, the performance of the SACNN model on different language pairs is influenced by various factors such as language complexity, data quality and quantity, cultural differences, and language characteristics. Therefore, future research should further explore model compression techniques such as knowledge distillation and model quantification, or lightweight architecture design, while ensuring model accuracy, to reduce computational costs. And further improve the performance of the model by increasing the scale and diversity of training data, optimizing the model architecture, and introducing more linguistic knowledge.

Footnotes

ORCID iD

Haoyi Zhang

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Maruf

Saleh

Haffari

. A survey on document-level neural machine translation: methods and evaluation. ACM Comput Surv 2021; 54(2): 1–36.

Haddow

Bawden

Barone

AVM

, et al. Survey of low-resource machine translation. Comput Linguist 2022; 48(3): 673–732.

Cheng

Zhu

Qian

, et al. Cross-modal graph matching network for image-text retrieval. ACM Trans Multimedia Comput Commun Appl 2022; 18(4): 1–23.

Chen

Zhang

, et al. Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 2022; 494(4): 171–181.

Wan

Yang

Wong

, et al. Challenges of neural machine translation for short texts. Comput Linguist 2022; 48(2): 321–342.

Zhang

, et al. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans Pattern Anal Mach Intell 2022; 45(1): 641–656.

Rossi

Oliva

Langmead

, et al. MONI: a pangenomic index for finding maximal exact matches. J Comput Biol 2022; 29(2): 169–187.

Iqbal

Qureshi

. The survey: text generation models in deep learning. Journal of King Saud University-Computer and Information Sciences 2022; 34(6): 2515–2528.

Zhang

Song

, et al. A survey of controllable text generation using transformer-based pre-trained language models. ACM Comput Surv 2023; 56(3): 1–37.

10.

Avrahami

Fried

Lischinski

. Blended latent diffusion. ACM Trans Graph 2023; 42(4): 1–11.

11.

Hickman

Thapa

Tay

, et al. Text preprocessing for text mining in organizational research: review and recommendations. Organ Res Methods 2022; 25(1): 114–146.

12.

Khurana

Koli

Khatter

, et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 2023; 82(3): 3713–3744.

13.

Min

Ross

Sulem

, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv 2023; 56(2): 1–40.

14.

Ranathunga

Lee

ESA

Prifti Skenduli

, et al. Neural machine translation for low-resource languages: a survey. ACM Comput Surv 2023; 55(11): 1–37.

15.

Huang

Chang

, et al. Video pivoting unsupervised multi-modal machine translation. IEEE Trans Pattern Anal Mach Intell 2022; 45(3): 3918–3932.

16.

Zhang

Yang

, et al. Frequency-aware contrastive learning for neural machine translation. Proc AAAI Conf Artif Intell 2022; 36(10): 11712–11720.

17.

Chung

Ahn

. The effect of using machine translation on linguistic features in L2 writing across proficiency levels and text genres. Comput Assist Lang Learn 2022; 35(9): 2239–2264.

18.

Pal

Roy

Shivakumara

, et al. Adapting a swin transformer for license plate number and text detection in drone images. Artificial Intelligence and Applications 2023; 1(3): 145–154.

19.

Bagal

Aggarwal

Vinod

, et al. MolGPT: molecular generation using a transformer-decoder model. J Chem Inf Model 2021; 62(9): 2064–2076.

20.

Nassiri

Akhloufi

. Transformer models used for text-based question answering systems. Appl Intell 2023; 53(9): 10602–10635.