Abstract
Due to the complexity of English machine translation technology and its broad application prospects, many experts and scholars have invested more energy to analyze it. In view of the complex and changeable English forms, the large difference between Chinese and English word order, and insufficient Chinese-English parallel corpus resources, this paper uses deep learning to complete the conversion between Chinese and English. The research focus of this paper is how to use language pairs with rich parallel corpus resources to improve the performance of Chinese-English neural machine translation, that is, to use multi-task learning to train neural machine translation models. Moreover, this research proposes a low-resource neural machine translation method based on weight sharing, which uses the weight-sharing method to improve the performance of Chinese-English low-resource neural machine translation. In addition, this study designs a control experiment to analyze the effectiveness of this study model. The research results show that the model proposed in this paper has a certain effect.
Introduction
With the continuous cooperation of countries around the world, English has become one of the most widely used languages in the world today. At present, almost one in eight people speak English, and one in four people are familiar with English. English is required in various fields such as international political exchanges, military cooperation, economics, trade, science, technology and culture, and the world’s leading scientific research is generally published in English. All in all, due to the increasing competitiveness of today’s society, learning English well is urgent for everyone [1].
Language is a communication tool. People communicate in language, and words are images or symbols that record people’s thoughts and carry the language. People communicate daily and culturally through language and words. Different countries and different ethnic groups have different languages, so communication is inconvenient. However, we can use machine translation to perform language conversion to facilitate communication. Deep learning has been successfully applied in different directions of natural language processing, and research on neural network machine translation methods based on deep learning has also developed rapidly. Recently, neural network machine translation methods have achieved results comparable to statistical machine translation methods [2], so researchers gradually began to use neural networks to improve machine translation. Although there are many methods and technologies that can perform machine translation, it also faces a problem: different machine translation methods or technologies cannot fully describe the information in the source language in the translation of the same source language sentence, and the representation of the source language sentence will inevitably have errors. Retelling is very common in natural language. Generally speaking, it is actually using another expression for a phrase, sentence or even the entire article, but the meaning of the expression has not changed compared with before the transformation. For a source language sentence, after retelling, one or more retelling results can be produced, and the semantics will not change [3]. Retelling can modify and supplement the results of machine translation, try to make up for the errors caused by the above problems, and make the translation results more fully describe the source language. In addition, the system fusion method is also a solution to the above problems. The system fusion method targets multiple different systems and fuse the output of different machine translation methods to produce an output that is superior to the original machine translation result, thereby improving the machine translation results [4].
Related work
After a detailed study of machine learning, literature [5] believed that computers have two important problems in dealing with neural networks. First, the computer cannot handle XOR circuits. Second, the computer does not have enough capacity to handle the work required by large neural networks. Since then, research on neural networks has slowed down. It wasn’t until Hinton’s research on deep learning once again attracted people’s attention to neural networks. Machine Translation (MT), also known as automatic translation, is a process of translating a natural language (source language) to another natural language (target language) by using a computer and retaining the original semantics. Machine translation is not only a branch of computational linguistics, but also an important research direction of natural language processing. Moreover, it is one of the important goals of research in the field of artificial intelligence and has important scientific research value. From the beginning of the birth of the computer, people have begun to study machine translation and continue to the present. After several decades of ups and downs, machine translation has experienced a long process of initiation, frustration, recovery and a new period. At the same time, machine learning has also been continuously studied by the academic community as a scientific problem, so machine translation methods have also undergone the evolution of multiple methods. It has transitioned from the initial direct translation method to the rule-based translation method to the corpus-based translation method. Memory-based translation methods and example-based translation methods are now gradually replaced by statistical translation methods and neural network-based translation methods.
The literature [6] proposed a neural network machine translation system with Encoder-Decoder architecture. The literature [7] proposed to add Attention mechanism to the RNN model for image classification, which makes the attention mechanism really popular. After further research, the literature [8] added attention to the task of machine translation in the Encoder-Decoder mechanism, which first applies the attention mechanism to the research of natural language processing. This work laid the foundation for many subsequent works on neural network systems, so that similar attention mechanism RNN models began to be applied to various NLP tasks. The literature [9] explained how Attention is extended in RNN, which plays a great role in the subsequent application of various Attention-based NLP. Subsequently, people began to explore convolutional neural networks (CNN) in deep learning. The literature [10] proposed to use Attention mechanism in CNN, and successfully completed the machine translation method based on convolutional neural network. Although the research time of machine translation combined with artificial neural network is relatively short, it has completely achieved translation results that are competitive with statistical translation methods.
Because manual evaluation is time-consuming, expensive, and reproducible, automatic evaluation of translation quality is crucial to the development of a high-quality English translation training system [11]. If we have perfect automatic evaluation indicators, we can adjust the English translation training system according to the indicators. Compared with machine correction, manual correction is a time-consuming and labor-intensive process. When the speed of correction is high, the amount of correction is huge, and the quality of the correction is high, manual correction is usually unable to meet the requirements. The progress of the English translation training system stems from evaluation. Developers can benefit from inexpensive automatic evaluation, which is fast and highly relevant to manual evaluation. The better the English translation training system, the more the results of its evaluation output resemble the results of human corrections. The English training system for translation appeared relatively late, and it was not until the late 1990 s that the English training system for translation was first proposed [12]. Researchers combined with natural language processing technology in the English translation training system to automatically evaluate the translation, including system internal measurement and evaluation output. The literature [13] used n-dimensional vectors to measure the candidate output generated in natural language, while Ringger et al. used n-dimensional vectors in 2001 to compare the output of English translation training systems. The literature [14] uses the distance between the reference translation sentence vector and the output sentence vector to evaluate the translation quality of the English translation training system. Translation and writing are subjective questions in the English test, so there is a certain similarity between the two. The English composition training system mainly includes the analysis of the words, grammar, and degree of the topic of the composition to be tested. The English composition training system is currently close to saturation and the automatic correction training system in the field of translation is still in the state of enlightenment, and there are certain differences between the two. For example, the composition is written according to the topic, and the answer is very wide and there is no standard answer. However, the translation is based on the specified sentence or paragraph, and it has a standard answer and the translation must be closely related to the text to be translated. The famous American translation theorist Eugene once said: “Translation means translation semantics”, so the key to automatic translation scoring lies in judging how similar the student’s translation is to the standard translation semantics. The algorithms of the English translation training system abroad are mainly aimed at the machine translation system. One of the more representative algorithms is BLEU (Bilingual Evaluation Under-study) proposed in the literature [15]. This method mainly compares the translation to be tested with the standard translation and uses the space length between them to generate the translation score. The literature [16] calculated the semantic similarity between the target translation and the standard translation based on the vector space model (VSM) to judge the quality of the target translation. Since the VSM method cannot obtain the semantic similarity between synonyms, it is necessary to manually adjust the correction results. Therefore, there is greater subjectivity. According to the translation scoring criteria for Chinese college students’ English exams, a comprehensive evaluation of a translated translation depends not only on its text content, but also on the use of word grammar and the completeness of semantic expression [19]. The current English translation training system does not consider the impact of word grammatical errors on the translation score when analyzing the translation, and it cannot accurately obtain the semantic similarity between the translation to be tested and the standard translation.
Recurrent neural network
The neural network is composed of many neural units superimposed according to a certain rule. The construction of the internal structure of the neuron is inspired by the biological neuron, which abstracts the behavior of the biological neuron when it is stimulated by the outside world into a mathematical language [20–22].
Figures 1 and 2 show the neuron structure and neural network structure, respectively. In Fig. 1, X = (x1, x2, ⋯ , x
n
) represents the input of the neural unit, W = (w1, w2, ⋯ , w
n
) represents the weight of each input, and θ represents the threshold of the neural unit, which can be set, and y represents the output of the entire neural unit, which is expressed by a mathematical formula as:

Neuron structure.

Neural network structure.
In the neural network introduced earlier, the output at the current time is only related to the input at the current time. Although this structure makes the neural network model easy to train, it weakens the learning ability of the model to a certain extent. In many realistic tasks, the input at the current time and the output over a period of time in the past determine the output at the current time. The recurrent neural network can not only receive its own information at the current moment, but also the state information of other neural units in the past period of time. Therefore, the recurrent neural network has a “memory” function. Compared with feedforward neural networks, the characteristics of recurrent neural networks are closer to the structure of biological neurons, and recurrent neural networks can handle the task of variable-length sequences. The parameters of the recurrent neural network can be updated by gradient descent methods, such as full gradient descent method, small batch gradient descent method, etc. Figure 3 shows the structure of a single recurrent neural unit:

Structure of a single neural unit.
As shown in Fig. 3, x
t
represents the input of the recurrent neural network at time t, ht-1 represents the output of the recurrent neural network at time t - 1, and h
t
represents the output at the current time. Therefore, for the input sequence X = (x1, x2, ⋯ , x
t
, ⋯ , x
T
), the output of this neural unit is:
Among them, f (·) represents the activation function. When t = 0, because the neural unit cannot obtain the output of the last time, the value of h0 can be processed by random initialization. Figure 4 is a recurrent neural network expanded according to time series. The network has only a single intermediate layer:
As shown in Fig. 4, W is the input parameter matrix, U is the state parameter matrix, and V is the output parameter matrix. If the input of the network at time t is x
t
, the output of the neural unit at the current time is related not only to the input x
t
at the current time but also to the output ht-1 at the previous time.

Recurrent neural network expanded according to time series.
In the above formula, b represents the offset of the neural unit, f (·) represents the activation function, which generally uses a non-linear function. Therefore, the output of the recurrent neural network is:
The loss function of the network at time t is defined as:
In the formula, L (·) is a derivable function. Then the total loss function is:
According to the chain derivation rule, we can know:
Among them, δt,k is the derivative of the loss at time t to the net input z
t
of the hidden neuron in step k:
The recurrent neural network can process variable-length sequences and has a “memory” function, that is, it can capture the dependence between words in long sentences. However, the problems of “gradient explosion” and “gradient disappearance” in recurrent neural networks often cause the network structure to perform poorly on specific tasks. If β =∥ diag (f′ (z i )) U T ∥, the value in parentheses in the above formula is βt-k. Then according to the mathematical limit, we can know that when t - k → ∞ , β > 1, then βt-k→ ∞, which will cause the instability of the recurrent neural network model. This is the “gradient explosion” problem. When t - k → ∞ , β < 1, then βt-k → 0, which will cause the parameters of the recurrent neural network model to stop updating. This is the problem of “the gradient disappears”. Therefore, the problems of “gradient explosion” and “gradient disappearance” make the recurrent neural network only obtain the relationship between two words that are closer.
The problems of “gradient explosion” and “gradient disappearance” of recurrent neural networks cause the model to learn only short-term dependencies and prevent it from learning long-term dependencies. Therefore, people proposed long-term and short-term memory networks to solve this problem. The essential reason for the “gradient explosion” and “gradient disappearance” problems is that the recurrent neural network is very deep in the time dimension. When the activation function uses a non-linear function, an excessively large derivative value will cause system instability, while an excessively small derivative value will cause the network model parameters to not be updated. The long-term and short-term memory network transforms the internal structure of the recurrent neural network and changes the non-linear dependence of the circular edge to a linear dependence, so that its derivative value can be set to 1, and there will be no problems of “gradient explosion” and “gradient disappearance”. However, because the neural unit uses a linear activation function, it will reduce the ability of the entire network model to fit a nonlinear function and reduce the performance of the model. In order to solve this problem, a “gate” structure is added inside the long-term memory network. That is, the “forget gate”, “input gate” and “output gate” are added inside the long-term and short-term memory unit to improve the non-linear ability of the long-term and short-term memory network. Through this transformation, the long-term and short-term memory network can effectively avoid the “gradient explosion” and “gradient disappearance” problems, while maintaining the ability of the neural unit to fit nonlinearly. Therefore, long-term and short-term memory networks are widely used in the field of natural language processing.
Three “gate” structures are designed in the long-term and short-term memory network, namely the forget gate, the input gate, and the output gate, and they are used to improve the nonlinear capability of the long-term and short-term memory network.
The role of the forget gate is to discard some useless information in the process of information transmission. The input of this gate is the input x
t
at the current moment, the output ht-1 of the neural unit at the previous moment and the state ct-1 of the neural unit at the previous moment. The output is a value between 0 and 1, 0 means that the information is completely discarded, 1 means that the information is completely retained, and the value is passed to the neural unit state ct-1.
In the formula, σ (·) is the nonlinear activation function, W f is the weight matrix of the forget gate, and b f is the offset.
The role of the input gate is to confirm the updated information, that is, how to add the information of ht-1 and x
t
to ct-1 so as to pass the state information to the next neural unit. The input gate is composed of two parts, which are linearly transformed corresponding to the input x
t
at the current time and the output ht-1 of the neural unit at the previous time. Then, the two parts of the results are integrated and passed to ct-1.
Among them, σ (·) is the nonlinear activation function, W i and W c are the weight matrix of the forgetting gate, and b i and b c represent the bias of the forgetting gate.
The function of the output gate is to determine the output value of the entire long-term and short-term memory neural network. This value is determined by the state information c
t
of the long- and short-term memory network, the input x
t
at the current time and the output ht-1 of the network at the previous time. First, the model uses a nonlinear activation function to operate on the neuron state c
t
, and then uses another nonlinear activation function to operate on x
t
and ht-1. The final output of the long-term and short-term memory neural network is related to o
t
and c
t
.
Among them, σ (·) is the nonlinear activation function, W o is the weight matrix of the output gate, and b o is the bias. The long-term and short-term memory network solves the defects of the original recurrent neural network, avoids the problems of “gradient explosion” and “gradient disappearance”, and information can be transmitted over long distances in the long- and short-term memory network. Therefore, the network model is widely used in natural language processing, image processing and other fields.
Neural network-based machine translation technology has now become the mainstream technology in the field of machine translation, and is widely used in various scenarios, such as voice translation and web page translation. In the field of neural machine translation, there are many classic model architectures, such as seq2seq model, Transfonner model, etc. However, these models use an encoder-decoder structure based on the attention mechanism, that is, use the encoder to extract the source language features, use the decoder to convert the source language to the target language, and use the attention mechanism to pass the source language information to the decoder.
Encoder-decoder is a neural machine translation framework structure proposed in recent years and is also widely used in other fields of natural language processing. The encoder and decoder of this model are generally composed of a recurrent neural network, and a bidirectional recurrent neural network can be used in the encoder part. Therefore, the model can capture both the front-end information of the current word and the back-end information.
Unlike statistical machine translation, neural machine translation uses an end-to-end approach, and all parameters are used to adjust the translation model in order to obtain the best translation performance. In this structure, the source language sentence is segmented and input into the encoder, and then the source language sentence is encoded into a fixed-length vector using a recurrent neural network. The decoder uses this vector as its initial state and combines it with the word vector of the target language to generate the final translation. For the input x = (x1, x2, ⋯ , x
Tx
), the encoder uses a bidirectional recurrent neural network to encode source language sentences into fixed-length vectors. The recurrent neural network contains a forward network and a backward network. The forward network reads the source language data from left to right:
The backward network reads the source language data from left to right:
Among them, E
x
represents the word vector matrix, and
Among them, v in the above formula represents the hidden layer state of the encoder. The encoder-decoder has achieved very good results in natural language processing. However, this structure compresses the source language sequence into a vector with a fixed dimension, which makes it difficult for the encoder to process long sentences, and the source language information that the decoder can obtain when predicting the target sequence is limited. Therefore, people proposed an attention mechanism model to solve this problem.
Figure 5 is a schematic diagram of the attention mechanism. In neural machine translation, the attention mechanism changes the transmission method of the source language information. When the decoder predicts the target language sequence, it assigns a weight to each vocabulary of the source language. Then, after the weights and the encoder state are weighted and summed, the result is input to the decoder. The attention mechanism enables the decoder to make full use of the source language information when predicting the target sequence. Moreover, the weight of the source language vocabulary that is more relevant to the current target vocabulary is assigned, the greater the weight, that is, the useful information of the encoder can be strengthened, and the useless information can be weakened.

Structure diagram of attention mechanism.
As shown in Fig. 5,
Among them, score (·) is a function used to measure the similarity between
In the above formula, a
ts
is the similarity weight of the source language state
The neural machine translation without the introduction of attention mechanism does not perform well on long sentences. The main reason is that the semantic information of the sentence is completely represented by an intermediate vector, the information of the source language vocabulary cannot be transferred to the target language, and the decoder loses a lot of detailed information. The attention mechanism will enable the neural machine translation model to capture the source language information most relevant to the current vocabulary when decoding and improve the performance of the machine translation model on long sentences. At present, attention mechanisms have been widely used in neural machine translation models and other fields of natural language processing.
Conditional random fields (CRF) is a discriminant undirected graph model. For the input observation sequence x = (x1, x2, ⋯ , x n ) and the output label sequence y = (y1, y2, ⋯ , y n ), the conditional random field is to model and analyze the probability p (y|x). The components of the marker sequence Y are related to each other, that is, the sequence may be a structural variable.
Conditional random fields are widely used in many research fields of computational linguistics. For example, in the part-of-speech tagging task, X represents the sentence after word segmentation, and y represents the part-of-speech of the sequence. As shown in Table 1, in this table, x ={ Me, like, eat, apple, and, banana } is the observation sequence, and y ={ N, Vi, V, N, C, N } is the corresponding part-of-speech sequence. The meaning of the letters in the part-of-speech sequence set Y is shown in Table 2. The character N, Vi, V, C represents the noun, intransitive verb, verb, and conjunction respectively, and y is a linear sequence structure. English morphological segmentation is similar to part-of-speech tagging and the tag sequence is also a linear structure, so a linear chain conditional random field can be used to solve this problem. If it is assumed that the graph G =〈 V, E 〉 represents an undirected graph composed of a random variable x and a label sequence Y, V represents a node in the undirected graph G, y
v
represents the labeled variable corresponding to the node, and E represents an edge in the undirected graph, then the expression of conditional probability is:
Examples of part-of-speech tagging
The meaning of characters of part-of-speech tagging
In formula (22), w ∼ v represents that w and v are adjacent in graph G, and w ≠ v represents that v is not included in the node set. Therefore, the chain conditional random field is shown in Fig. 6:

Chain conditional random field.
The definition of the conditional random field can be converted into a potential function and a group on the graph structure. In the linear chain conditional random field, there is a single random variable {y
i
} and its adjacent labeled variable {yi-1, y
i
}. Therefore, the formula of the conditional random field can be expressed as:
Among them,
In the above formula, t k (yi-1, y i , x, i) represents the transfer feature function, which characterizes the relationship between the label values of two adjacent input sequence variables. s l (y i , x, i) is the state feature function, which describes the influence of the input sequence on the labeled variables. Z (x) is the normalization factor, which is used to ensure that the probability is correctly defined.
The initial learning rate of the experiment is l, and it decreases continuously as the number of model iterations increases, which is more in line with the characteristics of neural network weight adjustment. The number of nodes in the unit is set to 512. In the experiment, the influence of different numbers of neural nodes on the neural machine translation model is also discussed. The maximum length of the sentence is 50, and the dimension of the word vector is 512.
The results of neural machine translation based on weight sharing are shown in Table 3. As can be seen from the table, compared with the baseline model of neural machine translation, the method of weight sharing can effectively improve the performance of low resource neural machine translation.
Experimental results on three sets of data sets
Experimental results on three sets of data sets
Figure 7 shows the curve of the error of the global weight sharing neural machine translation model and the baseline system as the number of iterations increases. The corresponding data is shown in Table 4. It can be seen from this curve that the global weight sharing neural machine translation model converges faster than the baseline system model, that is, the global weight sharing neural machine translation model reaches the point where the error value is smaller in a shorter time. However, from the final error curve trend, we can see that after the neural network parameters are continuously updated, the errors of the baseline system and the global weight sharing neural machine translation model can reach a small value. This shows that both neural machine translation models have reached the convergence state.

shows the curve of the error of the global weight sharing neural machine translation model.
Error data
Figure 8 shows the variation curve of BLEU value of the baseline system and the global weight sharing neural machine translation model with the number of iterations. The corresponding data is shown in Table 5. It can be seen from this curve that at the initial moment, the BLEU values of the baseline system and the global weight sharing model are very similar. However, with the increasing number of iteration steps, the performance of the global weight sharing neural machine translation model exceeds the performance of the baseline system. It can be seen from Figs. 7 and 8 that, although the error between the training set of the baseline system and the global weight sharing model is very similar, the performance of the global weight sharing model is better on the test set.

BLEU value curve.
Statistical table of BLEU values
The algorithm of this study is compared with the translation function of traditional machine learning algorithms. The results are shown in Fig. 9 and Table 6.

Comparison diagram of recognition accuracy.
Comparison table of recognition accuracy
As can be seen from Fig. 9, the algorithm model of this study is higher than the traditional model in translation accuracy. From the actual situation, it can be seen that the improved machine learning automatic translation model can be applied to actual translation.
In view of the characteristics of English word formation, this paper uses a combination of bidirectional long-term and short-term memory networks and conditional random fields to split it into roots and affixes. In this method, a bidirectional long-term and short-term memory network is used to extract linguistic features of English, and then a conditional random field model is used to predict the sequence output. After segmenting English, it can alleviate the problem of scarce words and unregistered words in Chinese-English machine translation to a certain extent, and the network model can extract richer semantic information, thereby improving the performance of Chinese-English neural machine translation. Moreover, this study proposes a low-resource neural machine translation method based on transfer learning. This method uses parameter migration to improve the performance of Chinese-English neural machine translation. First, this study uses an encoder-decoder framework (attention mechanism) to train a high-resource neural machine translation model, and then uses some network parameters on the model to initialize a low-resource neural machine translation model, which can provide an effective prior distribution for the low-resource neural machine translation model. From the research results, it can be seen that the algorithm model proposed in this study is higher than the traditional model in translation accuracy. From the actual situation, it can be seen that the improved machine learning automatic translation model can be applied to actual translation.
Acknowledgment
This paper was supported by (1) Key Project of Humanities and Social Sciences Research in Colleges and Universities 2019 of Hebei Education Department “The Cooperative Trinity Mechanism of PCK Dynamization for English Major Students in Normal Universities” (Number: SD192023); (2)Hebei Province Quality Open Online Course 2019 –“Oral English” of Hebei Education Department.
