Intelligent English translation system based on evolutionary multi-objective optimization algorithm

Abstract

The difficulty of obtaining the characteristics of the corpus database of neural machine translation is a factor hindering its development. In order to improve the effect of English intelligent translation, based on the machine learning algorithm, this paper improves the multi-objective optimization algorithm to construct a model based on the English intelligent translation system. Moreover, this paper uses parallel corpus and monolingual corpus for model training and uses semi-supervised neural machine translation method to analyze the data processing path in detail and focuses on the analysis of node distribution and data processing flow. In addition, this paper introduces data-related regularization items through the probabilistic nature of the neural machine translation model and applies it to the monolingual corpus to help the training of the neural machine translation model. Finally, this paper designs experiments to verify the performance of this model. The research results show that the translation model constructed in this paper is highly intelligent and can meet actual translation needs.

Keywords

Improved algorithm machine learning multi-objective optimization algorithm English translation

1 Introduction

With the development of information technology, the design of machine translation system based on computer-based integrated information processing has significantly improved the level of intelligence and precision of English translation. The application of automatic translation system is the main software carrier for machine translation. Therefore, the design of English automatic translation system has very important practical significance. The automatic English translation system analyzes English vocabulary features in detail by means of semantic analysis, and effectively combines semantic fuzzy matching and automatic phrase analysis methods to carry out large-scale sentence and vocabulary automatic translation to ensure the accuracy and reliability of translation through the combination of phrase translation.

Machine translation studies how to use computers to realize the automatic conversion from one natural language (source language) to another natural language (target language) to maintain the same semantics [1]. Early machine translation systems advocated the use of artificial rules and relied on the rules of conversion between languages discovered and summarized by human experts. The main challenge of this approach is that it has extremely high requirements for human experts, and a complete set of rules is usually not available to us. With the development of cloud computing and big data in recent years, corpus-based machine translation methods have gradually become the mainstream. This type of method advocates the use of a large-scale corpus to automatically learn mathematical models that can convert between natural languages and overcomes the bottleneck of rule-based methods. Among them, the most representative of this type of machine translation method is the statistical machine translation method [2].

In recent years, deep learning technology has developed rapidly. Due to the characteristics of deep neural networks that can automatically extract features, a large number of studies using deep learning technology to improve machine translation have begun to appear. The machine translation system that originally used deep learning technology still uses the framework of statistical machine translation, and uses deep learning technology to improve its components, such as improving the ordering model and word alignment model. Another type of machine translation system using deep learning technology no longer uses the framework of statistical machine translation but uses a neural network to directly map source language sentences to target language sentences. This approach greatly simplifies the model architecture and training process of the machine translation system, which is called an end-to-end neural machine translation system. The end-to-end neural machine translation system is a sequence-to-sequence model, usually using the encoder-decoder structural framework [3].

2 Related work

Machine translation refers to the process of translating a source language sentence into a semantically equivalent target language sentence through a computer and is an important research direction in the field of natural language processing [4]. The literature [5] proposed the idea of using machines for translation, which has since triggered a research boom in this direction. Machine translation can be divided into three main methods: rule-based machine translation, statistics-based machine translation, and neural network-based machine translation [6]. The rule-based method is the mainstream of machine translation research. This method has a good translation effect on sentences with standardized grammatical structure, but it also has the disadvantages of complicated rule compilation and difficulty in dealing with non-standard language phenomenon [7]. The literature [8] proposed a statistical machine translation model based on a noisy channel model. Machine learning methods such as deep learning have gradually matured and are beginning to be applied in the field of natural language processing. The literature [9] proposed the use of neural networks for machine translation, and the literature [10] proposed a neural machine translation model based on the encoder-decoder structure, which marked the entry of machine translation into the era of deep learning. The literature [11] compared neural machine translation and statistical machine translation on more than 30 language pairs. The results show that neural machine translation surpasses phrase-based statistical machine translation in 27 tasks, which demonstrates the powerful capabilities of neural machine translation. Although neural machine translation has shown better translation effects than statistical machine translation, it still has great development potential. The literature [12] proposed a neural machine translation model. This model solves the problem of the disappearance of the gradient of the depth model by introducing residual connections between the layers, stacking the number of model layers to 8 layers, and raising the level of machine translation to a new level. Literature [13] proposed an encoder-decoder model based on convolutional neural network, which surpassed Google’s model in accuracy and greatly improved the translation speed. The literature [14] proposed a Transformer model based on the attention mechanism, which has greatly improved the training speed and translation quality of the model. The literature [15] combined a variety of algorithms and introduced manual evaluation in translation evaluation and announced for the first time that the model has reached human level in the translation of news. Since 2019, many neural machine translation model structures have been proposed, and they have shown higher translation quality than the benchmark Transformer model in experiments. In addition, technologies such as reverse translation [16], data screening [17], and pre-training [18] also have significant effects on the improvement of translation effects.

Although neural machine translation has achieved quite high translation quality on the standard data set, there are still many problems to be solved in practical applications. Neural machine translation models have the problem of inconsistent behaviors during training and testing. This problem is called “exposure bias”, which has attracted wide attention from researchers. In addition, in the case of simultaneous interpretation, in order to reduce the translation delay, the model needs to output the translation when the input sentence is incomplete, so that the user can receive high-quality translation results with low delay. In addition to text, sometimes other modal data such as images and videos can be used by translation models. The translation system can incorporate this information to further improve the quality of translation. In order to improve the translation speed, the non-autoregressive model independently builds the translation probability, so that the entire sentence translation can be decoded in parallel. However, this method will also cause serious omission and over translation. When translating the text of the text, in order to ensure the consistency of the translation, the model also needs to consider the context information when translating. When there are low-resource data in the domain and high-resource data outside the domain, in order to improve the translation quality of the model in the domain, it is necessary to make reasonable use of the data outside the domain. When translating between multiple languages is needed, training a multilingual translation model can greatly reduce the number of translation models required, while improving the translation quality of low-resource languages [19].

Based on the advantages of the multi-modal machine translation model, the research of multi-modal machine translation fusing text and image visual information has attracted the attention of researchers in recent years. In the task of image description generation, referring to the end-to-end framework in machine translation, the literature [20] took the picture vector output by the pre-trained convolutional neural network as input and sends it to the decoding end as the initial implicit state vector representation. In this way, in the process of generating picture description sentences, the decoder can more fully use the semantic information in the picture to improve the effect of the picture description generation task. Under the framework of end-to-end machine translation based on encoding and decoding, the literature [21] integrated the visual information of pictures into the encoding and decoding ends of the translation framework to enhance the effect of pictures for text machine translation. In another work, two independent attention mechanism frameworks are used to process the word region and image region in the source language separately to improve the translation result of the model.

3 Problem definition

Taking multi-objective optimization as an example, the model in this paper mainly considers two important factors in the selection of the objective optimization route: one is the redundancy risk of the objective optimization route. The second is to give priority to areas with higher weights [22].

Definition 1: Redundancy risk: Data redundancy security is mainly affected by many factors. We set the redundancy risk to 0 before data analysis. After the English translation data processing, the redundancy risk is 1 when there is a problem in the data analysis, the risk index is a constant between 0 and 1. The larger the constant, the greater the risk of data redundancy in the translation process.

Definition 2: Time spent: the time spent on data processing is proportional to the length of the data [23], $w_{0}^{c} = l$ (1)

1 is the data length, v is the data processing speed, and the time spent when data is interrupted is the sum of the normal data processing time and the data economic processing time: $w^{c} = w_{0}^{c} + T$ (2)

T is the time-consuming data processing, and the time index needs to be processed without dimension.

In the process of data translation, the road network model is formed by the data recognition system, that is, the undirected weighted graph [24]. $G = (V, E, W)$ (3)

Among them, the data input point, output point and data processing center constitute node set V ={ v₁, v₂, ⋯ , v_n }, the data processing point set to be translated constitutes set Z ⊆ V, and the manual data input and manual replenishment points are denoted as v_s, v_t ∈ V. At the same time, the upper limit of the amount of data processing is M, the road network constitutes an edge set E ={ e₁, e₂, ⋯ , e_m }, and each road in the road network is a disordered pair composed of two different nodes, which is recorded as: $\begin{matrix} e_{k} = (v_{i}, v_{j}) \in E \\ k = 1, 2, 3, \dots, m \\ i, j \in {1, 2, 3, \dots, n} \\ i \neq j \end{matrix}$ (4)

The security risk and time cost of each data channel are standardized weight vectors [25]. $\begin{matrix} w_{k} = (w_{k}^{r}, w_{k}^{c}) \\ k = 1, 2, 3, \dots, m \\ 0 < w_{k}^{r} < 1 \\ 0 < w_{k}^{c} < 1 \end{matrix}$ (5)

A path from node v_i to node v_j is recorded as an ordered set. $p_{i \to j} = [v_{i}, v_{i + 1}, v_{i + 2}, \dots, v_{j}]$ (6)

The set of nodes that v_i can reach through only one edge is recorded as σ (i) ⁺, and the set of nodes that can reach v_i through only one edge is recorded as σ (i) ^-. Then, if and only if v_i+1 ∈ σ (i) ⁺, p_i→j is a feasible path from node v_i to node v_j. After the constraint conditions of the severely damaged node set Z are further considered, then $\cup_{m = 1}^{M} p_{v_{s} \to v_{t}}$ is a feasible solution. If and only if all feasible solutions of $v_{z} \in \cup_{m = 1}^{M} p_{v_{s} \to v_{t}} \forall z \in Z$ form a set $ℙ$ , the optimization goal is to find p^* ∈ P to minimize the path safety risk and time cost. There are the following 0-1 planning models. $min F_{r} (x) = \sum_{m = 1}^{M} \sum_{(v_{i}, v_{j}) \in E} x_{(v_{i}, v_{j})}^{m} w_{(v_{i}, v_{j})}^{r}$ (7) $min F_{c} (x) = \sum_{m = 1}^{M} \sum_{(v_{i}, v_{j}) \in E} x_{(v_{i}, v_{j})}^{m} w_{(v_{i}, v_{j})}^{c}$ (8) ${\begin{matrix} \sum_{v_{j} \in σ {(s)}^{+}} x_{(v_{s}, v_{j})}^{m} - \sum_{v_{i} \in σ {(s)}^{-}} x_{(v_{i}, v_{s})}^{m} = 1, if u_{m} = 1 \\ \sum_{v_{j} \in σ {(s)}^{+}} x_{(v_{s}, v_{j})}^{m} - \sum_{v_{i} \in σ {(s)}^{-}} x_{(v_{i}, v_{s})}^{m} = 0, if u_{m} = 0 \end{matrix} \forall m \in M$ (9) ${\begin{matrix} \sum_{v_{i} \in σ {(t)}^{-}} x_{(v_{i}, v_{t})}^{m} - \sum_{v_{j} \in σ {(t)}^{+}} x_{(v_{t}, v_{j})}^{m} = 1, if u_{m} = 1 \\ \sum_{v_{i} \in σ {(t)}^{-}} x_{(v_{i}, v_{t})}^{m} - \sum_{v_{j} \in σ {(t)}^{+}} x_{(v_{t}, v_{j})}^{m} = 0, if u_{m} = 0 \end{matrix} \forall m \in M$ (10) ${\begin{matrix} \sum_{v_{i} \in σ {(k)}^{-}} x_{(v_{i}, v_{k})}^{m} - \sum_{v_{j} \in σ {(k)}^{+}} x_{(v_{k}, v_{j})}^{m} = 1, if u_{m} = 1 \\ \sum_{v_{i} \in σ {(k)}^{-}} x_{(v_{i}, v_{k})}^{m} - \sum_{v_{j} \in σ {(k)}^{+}} x_{(v_{k}, v_{j})}^{m} = 0, if u_{m} = 0 \end{matrix}$ (11) $\forall v_{k} \in V ∖ {v_{s}, v_{t}}, m \in M$ (12) $\sum_{m = 1}^{M} \sum_{v_{j} \in σ {(z)}^{+}} x_{(v_{z}, v_{j})}^{m} ⩾ 1, \forall z \in Z$ (13) $u_{m} \in {0, 1}, \forall m \in M$ (14) $x_{(v_{i}, v_{j})}^{m} \in {0, 1}, \forall v_{i}, v_{j} \in V, m \in M$ (15)

Among them, u_m indicates whether to make a data processing node, and $x_{(v_{i}, v_{j})}^{m}$ indicates whether the m path passes through the edge (v_i, v_j). $\sum_{v_{j} \in σ {(i)}^{+}} [x_{(v_{i}, v_{j})}^{m}]$ represents the out-degree of node v_i on path m, and the corresponding $\sum_{v_{i} \in σ {(j)}^{-}} [x_{(v_{i}, v_{j})}^{m}]$ represents the in-degree of node v_j on path m. The specific meaning of each formula is as follows [26]:

Formula (7) means that the total security risk of the M groups of data redundant processing routes is the smallest, and formula (8) means that the total time cost of the M groups of data redundant processing routes is the smallest. Formula (9) means that when the m-th group of data redundancy processing is performed from node v_s at time u_m = 1, the out degree of node v_s on the m path is one greater than the in degree. When the m-th group of data redundancy processing is not performed from the v_s node at u_m = 0 time, the out degree and in degree of the v_s node on the m path are equal and zero. Formula (10) is similar to formula (11), restricting the out-degree and in-degree of v_t node on m path. Formula (12) means that the out-degree and in-degree of each node except the starting point v_s and the end point v_t of the network path of each group of data redundancy processing are equal. Formula (13) means that node z ∈ Z is accessed by at least one of the M data processing paths. Formula (14) means that u_m is a binary variable with a value of 1, if and only if the m-th group of data redundancy processing plan is executed. Formula (15) shows that $x_{(v_{i}, v_{j})}^{m}$ is a binary variable with a value of 1, if and only if the m-th network route passes on the edge (v_i, v_j). The above model is not complete, because when the solution is a number of sub-loops instead of a complete path from v_s to v_t through all z ∈ Z, the solution according to M1 is still feasible. The set of all nodes visited by path m in the feasible solution obtained by M1 is expressed as follows: $C = {v_{i}, v_{j} | x_{(v_{i}, v_{j})}^{m} = 1}, \forall m \in M$ (16)

We set S_c to represent a subset of C and add the following sub-loop to remove constraints: $\sum_{v_{i} \in S_{c}} \sum_{v_{j} \notin S_{c}} x_{(v_{i}, v_{j})}^{m} ⩾ 1, \forall m \in M$ (17)

Formulas (7) ∼ (16) and (17) constitute a complete multi-objective route optimization model of designated data output points and replenishment points.

The model proposed in this paper is an extension of the traveling salesman problem and the shortest path problem. The traveling salesman problem generally requires that all nodes are visited once and then back to the starting point, that is, there is no loop in the path. The shortest path problem requires the shortest path between the specified node pair.

The model in this paper requires generating at most M paths with the shortest sum of the length of the specified node set between the specified node pairs and allows loops in the paths. Under the condition of the same number of nodes, the model in this paper is more difficult to solve than the simple traveling salesman problem and the shortest path problem. When the edge of the network model G has only one weight and M = 1 is limited, the above model has the following solution:

First, the shortest path between any pair of nodes in Z is found, and the length of the route is recorded as d_ij, $(v_{i}, v_{j}) \in Z$ (18)

Then, d_ij is taken as the distance between i and j, and Z is solved by TSP to get the final solution. The above solving ideas cannot be used directly when M ≠ 1, and cannot be naturally extended to the situation where there are multiple objective functions.

The multi-objective genetic algorithm operates on a set of feasible solutions and gives a set of non-inferior solutions after several iterations. It is a practical method to solve multi-objective optimization problems. In this paper, the model solving algorithm is designed based on genetic algorithm. Among them, the key issues are: 1) the generation of the initial feasible solution; 2) how to ensure that the offspring produced by the cross mutation is a feasible solution during the constrained optimization solution process; 3) How to weigh the possibility of the non-dominated solution and the dominating solution entering the offspring to make the selection of the offspring; 4) multi-objective optimization algorithm stopping conditions and optimal solution extraction.

Solving algorithm design

For the shortest path problem that must pass through a number of designated nodes, if the designated node repeats multiple times, once the order of the nodes is determined and the path between nodes is determined, a unique path can be determined. The algorithm design in this paper is based on this basic idea unfolds.

Initial population generation

First, node z ∈ Z is randomly assigned to M sequences. Among them, M is the upper limit of the number of rescuers, the m-th sequence is like $[z_{m}^{1}, z_{m}^{2}, \dots, z_{m}^{*}]$ , and there is $\cup_{m = 1}^{M} [z_{m}^{1}, z_{m}^{2}, \dots, z_{m}^{*}] = Z$ (19)

That is, all z ∈ Z is accessed by 1 of M paths. Then, for all m ∈ Ms, starting from $z_{m}^{i}$ , the adjacent $z_{m}^{i + 1}$ nodes are searched until all the paths between adjacent nodes are determined. Finally, the designated point v_s from $z_{m}^{1}$ is searched, and the designated replenishment point v_t from $z_{m}^{*}$ is searched. In this way, M paths from v_s to v_t through all z ∈ Z are generated. $[v_{s}, \dots, z_{m}^{1}, \dots, z_{m}^{2}, \dots, z_{m}^{*}, \dots, v_{t}] \forall m \in M$ (20)

These paths constitute a complete data processing plan, that is, a feasible solution. Several initial feasible solutions can be obtained by repeating so many times, which is the initial population of genetic algorithm.

The above algorithm ensures that there is no loop in the path from $z_{m}^{i}$ to $z_{m}^{i + 1}$ . However, there may be z ∈ Z in the path from $z_{m}^{i}$ to $z_{m}^{i + 1}$ , which results in a loop in the entire path. The appearance of loops is sometimes beneficial to shorten the path length.A single-weight graphical model is taken as an example. We assume that 3 in Fig. 1 is a node that must be passed, and 1 and 2 are the start and end points, respectively. If the generated path does not contain a loop, the solution is [1, 3, 4, 2] in the left figure, and the length is 13. If there is a loop in the generated path, the solution is [1, 4, 3, 4, 2] in the right figure, and the length is 12. Therefore, for the shortest path between node pairs via a specified node set, the existence of certain types of loops can help shorten the total path length.

Fig. 1

Circuit that shortens the path length.

The loops in the generated path have the following definitions of “beneficial loop” and “non-beneficial loop”. Among them, beneficial loops are loops that may reduce path safety risks or time spent, and non-beneficial loops are loops that are not beneficial to reducing path safety risks or time spent. The specific definition is as follows.

Definition 1: beneficial loop: loop in path p $p_{l} = [v_{k *}, v_{k 1} v_{k 2}, \dots, v_{k *}]$ (21) is a beneficial circuit if and only if ∃v_ki∈ Z, i ≠ *.

Definition 2: Non-beneficial loop: loop in path p $p_{l} = [v_{k *}, v_{k 1} v_{k 2}, \dots, v_{k *}]$ (22) is a non-beneficial loop if and only if ∃v_ki∉ Z, i ≠ *

At the same time, the necessity of loop (p_l = [v_k*, v_k1v_k2, ⋯ , v_k*]) in path (p) is analyzed as follows. The loop endpoints (start and end points) and other nodes in the loop have the following situations: 1) There are only non-designated nodes in the entire loop $v_{ki} \notin Z, \forall v_{ki} \in p_{l}$ (23)

Clearing the loop will inevitably shorten the path length, and it will not cause the path to become an infeasible solution. This type of loop is unnecessary. 2) The starting point (that is, the end point) of the loop is a designated node, and other nodes in the loop are all non-designated nodes. $v_{k *} \in Z, v_{ki} \notin Z, \forall i \neq *$ (24)

Clearing the path other than the end points will necessarily shorten the path length, and it will not cause the path to become an infeasible solution. Therefore, paths other than the start and end points are unnecessary. 3) There is a designated node in the loop, and the same designated node exists outside the loop. $\begin{matrix} \exists v_{ki} \in Z, i \neq * \\ v_{ki} \in p ∖ {v | v \in p_{l}} \end{matrix}$ (25)

Such a loop may be beneficial to shorten the total length of the path. 4) There is a designated node in the loop, and the node does not exist outside the loop. $\begin{matrix} \exists v_{ki} \in Z, i \neq * \\ v_{ki} \notin p ∖ {v | v \in p_{l}} \end{matrix}$ (26)

Such a loop is necessary, and deleting the loop will make the solution infeasible. In summary, when all the nodes of the loop except the end point are non-designated nodes (there is no restriction on the situation of the end point, that is, the end point can be a designated node or a non-designated node), the loop can be cleared.

Therefore, the generated initial feasible solution does not have non-beneficial loops. After crossover and mutation operations are performed, the newly generated path needs to be cleared of non-beneficial loops.

4 Model building

The data processing path of the multi-objective optimization algorithm constructed in this paper is shown in Fig. 2. The first part is the data input layer, the second part is a number of available transit nodes, and the third part is a number of demand nodes.

Fig. 2

Data processing network structure.

The traditional language model is limited by the model order n (usually an integer from 1 to 5). It can only model the language model with the first n-1 words of a limited window size as a condition. Therefore, it may not be possible to capture enough contextual information in a vocabulary sequence of unlimited length, resulting in unsatisfactory language model effects. At the same time, whenever the order of the model increases by one unit size, the calculation of the probability value of the entire sequence increases exponentially, which can easily cause data explosion. In order to solve the shortcomings of traditional language models, the first large-scale deep learning neural network language model is proposed. The model can capture the contextual information of words by learning the distributed representation of words. When the context window size of the vocabulary sequence of the model increases, the number of parameters of the entire model only increases linearly. The structure of the neural network language model is shown in Fig. 3:

Fig. 3

The structure of the neural network language model.

Different from ordinary RNN, LSTM is a special neuron with long and short-term memory. We know that ordinary RNNs are extremely sensitive to short-term input, but cannot maintain long-term memory. By introducing the “gating mechanism” (gating mechanism), LSTM adds an internal state that can store long-distance information in the hidden layer unit of the ordinary RNN, thereby solving the long-distance dependency problem of the ordinary RNN. The structure diagram of the improved RNN with internal state c is shown in Fig. 4:

Fig. 4

RNN with improved structure.

First of all, in the process of training the source language-target language translation model using the method proposed in this paper, it is necessary to use the pre-trained target language-source language translation model to translate the target language sentences in the monolingual corpus. This process is similar to the method of back-translating monolingual corpus to generate pseudo-parallel corpus to enrich training data. The difference between the method proposed in this paper and these related works is that the method in the related work directly uses the generated pseudo-parallel corpus as a supplement to the original parallel corpus to participate in the training of the model, while the method in this paper uses the generated sentence to calculate the probability of each component in the full probability formula. This method of using monolingual corpus can reduce the negative effects of low-quality pseudo-parallel corpus on translation model training. In the method proposed in this paper, since the full probability formula is an inherent property that holds for any sentence pair, the quality of the sentences produced in the back translation process will not have such a big impact on the quality of the translation model. Furthermore, the method of sampling single sentences using the reverse translation model also includes a method based on monolingual corpus reconstruction. That is, the source language-target language and target language-source language translation models are used to reconstruct the monolingual corpus of the target language, and the source language-target language and target language-source language translation models are jointly trained. These reconstruction-based methods believe that translation models in both directions can benefit from the iterative process of joint training. However, in fact, due to the problem of error propagation, this training process is difficult to control. Specifically, errors generated in the source language-target language translation process will be propagated to the target language-source language translation process. Compared with reconstruction-based methods, the method proposed in this paper realizes the direct modeling of neural machine translation models by mining probabilistic properties. Specifically, the method proposed in this paper only needs to use the reverse translation model for one sampling, so there is no problem of error propagation. On the other hand, intuitively, the method proposed in this paper can also improve the effect of the reverse translation model through iterative training of the two models. However, in practice, when training a machine translation model, you can sample from any reverse translation model. Therefore, the process of iteratively sampling from the two models is not necessary in the method proposed in this paper.

The method of combining language models first uses monolingual corpus to train an independent language model, and then combines the trained language model into the training process of the neural machine translation model. Compared with the training method of maximum likelihood estimation, this method requires almost no additional training time and machine memory besides training the language model. However, this method of combining language models is very limited in improving the effect of machine translation models. The reason is that this method does not fundamentally solve the problem of parallel corpus training data shortage. On the other hand, for methods based on data enhancement, in general, although this type of method has different training goals and strategies, they are all based on the process of data enhancement. The process of data enhancement is to use the source language-target language translation model to translate the source language-side monolingual corpus or use the target language-source language translation model to translate the target language-side monolingual corpus. Due to the process of data enhancement, this type of method usually requires additional training time and machine memory. Training time and machine memory consumption mainly depend on the sample size. That is, after a single sentence is given, it is translated into the number of corresponding language sentences. Compared with the iterative sampling and training of the translation model, the method proposed in this paper only needs to translate the single sentence in the target language once, which saves a lot of training time. In summary, in semi-supervised neural machine translation algorithms, data enhancement is a necessary operation to improve translation effects in existing research. Therefore, longer training time and larger machine memory are a compromise to improve the model effect. In general, the method proposed in the paper has saved a lot of training time compared to the iterative training method.

5 Model performance analysis

Usually, bilingual corpus at the text level is scarce. Therefore, in this paper, we propose to use bilingual corpus (non-text) to train the completion model. In Table 1 to Table 3, from the bilingual corpus of different data scales, it can be seen that the performance of the completion model is still greatly improved. The corresponding statistical graphs are shown in Fig. 5 to Fig. 7.

Table 1
Statistical table of 900K text bilingualism

Benchmark model Completion

NIST06 37.75 38.49

NIST02 40.94 41.82

NIST03 40.36 41.72

NIST04 42.52 42.91

NIST05 39.29 39.95

NIST08 30.08 31.07

AVG 38.49 39.33

	Benchmark model	Completion
NIST06	37.75	38.49
NIST02	40.94	41.82
NIST03	40.36	41.72
NIST04	42.52	42.91
NIST05	39.29	39.95
NIST08	30.08	31.07
AVG	38.49	39.33

Table 2

Statistical table of 10 million bilingualism + 900K text bilingualism

	Benchmark model	Completion
NIST06	43.31	44.76
NIST02	44.93	45.49
NIST03	45.83	48.24
NIST04	46.04	46.76
NIST05	43.90	46.02
NIST08	37.51	38.81
AVG	43.53	45.02

Table 3

Statistical table of 130 million bilingualism + 900K text bilingualism

	Benchmark model	Completion
NIST06	45.38	47.02
NIST02	45.36	45.79
NIST03	47.69	46.11
NIST04	46.90	46.95
NIST05	45.65	47.02
NIST08	39.83	41.07
AVG	45.14	45.66

Fig. 5

Statistical diagram of 900K text bilingualism.

Fig. 6

Statistical diagram of 10 million bilingualism + 900K text bilingualism.

Fig. 7

Statistical diagram of 130 million bilingualism + 900K text bilingualism.

For the benchmark model and the completion model, the training methods are different. For the benchmark model, since large-scale bilingualism and a small amount of textual bilingualism belong to different fields of data, we adopt the way of training domain models, that is, pre-training + fine-tuning. However, for the training of the completion model, we consider that in large-scale bilinguals, there are no completed tags and completed words. If pre-training and fine-tuning methods are used, there is no fairness to the training completion model. Therefore, we use the completed text corpus and large-scale bilingual direct mixed training. For the two different training methods, we give specific experimental results in Table 4 and Fig. 8. In Table 4, we compare two different training methods in 10 million bilingualism + 900K text bilingualism. We found that when choosing different training methods, the benchmark model and the complement model can achieve different effects. The benchmark model chooses pre-training and fine-tuning methods to achieve the best results. However, for the completion model, the best result can be achieved by choosing the method of direct mixed corpus training.

Table 4

Comparison of the different training methods of the two models on the development set

	Training method	BLEU
Benchmark model	Pre-training + fine-tuning	43.31
	Mixed training	43.14
Completion model	Pre-training + fine-tuning	43.23
	Mixed training	44.76

Fig. 8

Statistical diagram of the comparison results of the different training methods of the two models on the development set.

6 Conclusion

Since neural machine translation was proposed, it has gradually replaced the traditional phrase-based statistical machine translation method with its strong performance. End-to-end neural machine translation no longer uses the cumbersome structure and complex feature design in statistical machine translation, but directly passes the parallel corpus to the neural network to complete the training of a complete translation system. However, while neural machine translation has brought many opportunities to industry and academia, there are also some challenges. In order to further improve the accuracy of the translation model and enhance the decoder’s ability to obtain encoder context information, we propose a new calculation method that can comprehensively consider context vectors at different times. Moreover, this paper proposes to use the completion method to solve the problem of incompleteness, consistency, and ambiguity of sentence representation at sentence level. This method first extracts information from each sentence in the text, and then adds the extracted information to the source sentence by tagging. Experimental results show that the completion method proposed in this paper has a very good effect on the text translation system.

Footnotes

Acknowledgment

Key project of the Ministry of Education in China, 2017. “Research on Blended Learning in the Context of the Internet +” (EIJYB2017_026).

References

Abdel-Hamid

, Mohamed

, Jiang

, et al., Convolutional neural networks for speech recognition, IEEE/ACM Transactions on audio, speech, and language processing 22(10) (2014), 1533–1545.

Al-Tamimi

, Revisiting acoustic correlates of pharyngealization in Jordanian and Moroccan Arabic: Implications for formal representations, Laboratory Phonology 8(1) (2017), 1–40.

Besacier

, Barnard

, Karpov

, et al., Automatic speech recognition for under-resourced languages: A survey, Speech Communication 56(3) (2014), 85–100.

Choi

H.N.

, Byun

S.W.

and Lee

S.P.

, Discriminative Feature Vector Selection for Emotion Classification Based on Speech, Transactions of the Korean Institute of Electrical Engineers 64(9) (2015), 1363–1368.

Haderlein

, Döllinger

, Matoušek

, et al., Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples, Logopedics Phoniatrics Vocology 41(3) (2015), 106–116.

Herbst

C.T.

, Hertegard

, Zangger-Borch

, et al., Freddie Mercury—acoustic analysis of speaking fundamental frequency, vibrato, and subharmonics, Logopedics Phoniatrics Vocology 42(1) (2016), 1–10.

Hill

A.K.

, Rodrigo

A.C.

, Wheatley

J.R.

, et al., Are there vocal cues to human developmental stability? Relationships between facial fluctuating asymmetry and voice attractiveness, Evolution & Human Behavior 38(2) (2017), 249–258.

Hsu

C.C.

, Cheong

K.M.

, Chi

T.S.

, et al., Robust Voice Activity Detection Algorithm Based on Feature of Frequency Modulation of Harmonics and Its DSP Implementation, IEICE Transactions on Information and Systems E98.D(10) (2015), 1808–1817.

Kang

T.G.

and Kim

N.S.

, DNN-Based Voice Activity Detection with Multi-Task Learning, Ieice Transactions on Information & Systems E99.D(2) (2016), 550–553.

10.

Kim

and Stern

R.M.

, Power-normalized cepstral coefficients (PNCC) for robust speech recognition, IEEE/ACM Transactions on audio, speech, and language processing 24(7) (2016), 1315–1329.

11.

Kumar

P.H.

and Mohanty

M.N.

, Efficient Feature Extraction for Fear State Analysis from Human Voice, Indian Journal of Science & Technology 9(38) (2016), 1–11.

12.

Leeman

, Mixdorff

, O’Reilly

, et al., Speaker-individuality in Fujisaki model f0 features: Implications for forensic voice comparison, International Journal of Speech Language and the Law 21(2) (2015), 343–370.

13.

, Deng

, Gong

, et al., An overview of noise-robust automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4) (2014), 745–777.

14.

Malallah

F.L.

, Saeed

K.N.Y.M.G.

, Abdulameer

S.D.

, et al., Vision-Based Control By Hand-Directional Gestures Converting To Voice, International Journal of Scientific & Technology Research 7(7) (2018), 185–190.

15.

Woźniak

and Połap

, Voice recognition through the use of Gabor transform and heuristic algorithm, Nephron Clinical Practice 63(2) (2017), 159–164.

16.

Mohan

, Hamilton

, Grasberger

, et al., Realtime voice activity and pitch modulation for laryngectomy transducers using head and facial gestures, Journal of the Acoustical Society of America 137(4) (2015), 2302–2302.

17.

Sleeper

, Contact effects on voice-onset time in Patagonian Welsh, Acoustical Society of America Journal 140(4) (2016), 3111–3111.

18.

Ngoc

and Duong

, HienThanh Duong. A Review of Audio Features and Statistical Models Exploited for Voice Pattern Design, Computer Science 03(2) (2015), 36–39.

19.

Nidhyananthan

S.S.

, Muthugeetha

and Vallimayil

, Human Recognition using Voice Print in LabVIEW, International Journal of Applied Engineering Research 13(10) (2018), 8126–8130.

20.

Noda

, Yamaguchi

, Nakadai

, et al., Audio-visual speech recognition using deep learning, Applied Intelligence 42(4) (2015), 722–737.

21.

Orlandi

, Garcia

C.A.R.

, Bandini

, et al., Application of Pattern Recognition Techniques to the Classification of Full-Term and Preterm Infant Cry, Journal of Voice 30(6) (2015), 656–663.

22.

Qian

, Bi

, Tan

, et al., Very deep convolutional neural networks for noise robust speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(12) (2016), 2263–2276.

23.

Rhodes

, Aging effects on voice features used in forensic speaker comparison, International Journal of Speech Language & The Law 24(2) (2017), 177–199.

24.

Sarria-Paja

, Senoussaoui

and Falk

T.H.

, The effects of whispered speech on state-of-the-art voice based biometrics systems, Canadian Conference on Electrical and Computer Engineering 2015(1) (2015), 1254–1259.

25.

Vincent

, Watanabe

, Nugraha

A.A.

, et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition, Computer Speech & Language 46(3) (2017), 535–557.

26.

Watanabe

, Hori

, Kim

, et al., Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE Journal of Selected Topics in Signal Processing 11(8) (2017), 1240–1253.