Abstract
With the rapid development of intelligent technology, IELTS translation education has gradually explored a brand-new education model. In order to solve the problems of small scale, slow speed and incomplete domain of the traditional bilingual parallel corpus machine translation, we construct an IELTS translation education corpus based on bilingual non-parallel data model, which can be used to train Moses, an IELTS translation education machine translation model, for better aids of translation education. In the process of construction, parallel sentence pairs are extracted from non-parallel corpus by using the translation retrieval framework represented by word graph, and a translation retrieval model based on bilingual non-parallel data is constructed. The experimental results of training Moses translation model with elementary IELTS translation corpus show that the bilingual non-parallel data model constructed in this paper has good translation retrieval performance. Compared with existing algorithms, the BLEU value extracted from parallel sentence pairs is increased by 2.58. Therefore, the proposed algorithm and corpus will do favor for the machine translation of IELTS. In addition, the retrieval method based on the structure of translation option word graph proposed in this paper is time-efficient and has better performance and efficiency in assisting IELTS translation education.
Keywords
Introduction
IELTS translation education is one of the important ways of international communications. Through IELTS translation education for people from different countries, we can solve the problem that different people in different regions understand the differences between English and their mother tongue, so as to better promote international communication [1, 2]. However, the traditional IELTS translation education only relies on teachers’ education in the classroom, which is not only inefficient, but also ineffective in translation training education. The rise of the internet and artificial intelligence technology in recent years has gradually spawned the intelligent mode of IELTS translation education [3]. In the intelligent IELTS translation education model, through natural language processing and intelligent machine translation, a machine translation based on corpus statistical model is formed to assist IELTS translation education, which is efficient and free from all kinds of restrictions [4, 5]. In fact, there are three different forms of bilingual corpus in the current machine translation model of IELTS translation education for translation training and retrieval [6]:
Parallel corpus: two different language texts with mutual translation relations are aligned in the language texts with sentence units. Parallel corpus usually obtains this type of corpus from parallel websites. Comparable corpus: two different language texts have similar content but not identical, and their alignment unit is usually a paragraph or a field. Comparable corpora generally come from the same website or website with similar bilingualism. Non-parallel corpus: the text of two different languages, and the data sparsity between the two texts is different, which makes it difficult to exist a large length of parallel sentence pairs between them. With the rapid development of the internet, non-parallel corpus gradually constructed a large number of corpus texts. There are still plenty of parallel phrases and words that can be translated into these two languages.
In IELTS translation education, the existing bilingual non-parallel data translation models are based on lexical and phrase models, which treat semantics as a collection of words and do not care about the order and structure of sentences in translation [7]. There are mainly two kinds of models based on words and phrases. One is bilingual vocabulary model, which is to obtain the translated word pairs through the characteristics of glyph and context, such as the typical relevance analysis method [8]. The other is IBM model [9], which is to solve the problem of training IBM model based on non-parallel corpus by decryption. The lexical and phrase-based model does not need to consider the structural information of sentences and the semantic relations among words, so the computational complexity is low, but the expressive ability on bilingual non-parallel corpus is poor, which cannot meet the key needs of IELTS translation education. In recent years, some researchers have learned the translation model based on continuous words from bilingual non-parallel corpus and retrieved the continuous word pairs in non-parallel corpus from known inter-translated words, and the related research shows that the known inter-translated words can construct more words from non-parallel corpus [10]. Therefore, in this paper, the parallel sentence pairs are extracted from the non-parallel corpus based on translation retrieval framework represented by word graph, and a bilingual non-parallel corpus is constructed, which enables the machine translation model in IELTS translation education to cope with various corpus and exhibit stronger expressive ability and wider coverage of translation range.
Translation retrieval model
The process of translation retrieval is to retrieve possible candidate translation sentences from a given source language sentence through the database of translation records [11]. Generally speaking, the problem of translation retrieval of monolingual data can be defined as follows: Suppose
where
where
However, in the actual process of translation retrieval, because the number of candidate translation sentences
where
where
where
The computational complexity is high and the efficiency of translation retrieval is low. It is necessary to enumerate all the translation results from n-best and query and retrieve them. Finally, the optimal retrieval results are obtained by comparison. The time complexity is The translation results produced by n-best are poorly differentiated. Although n-best produces many translation candidates, the core parts of translation are the same with some minor different translation details. Because the distinction between different translation results is not obvious, there is a larger probability of propagation of translation errors.
In order to solve the above problems in the n-best translation retrieval model, word graph structure is used to save a large number of candidate translations in n-best. These candidate translations are retrieved and sorted according to the word graph structure, so as to improve the accuracy, recall and efficiency of the machine translation retrieval model, and ultimately improve the auxiliary performance of bilingual non-peace data model in IELTS translation education.
Word graph structure is a directed acyclic graph in which there is only one start node and one end node, and each edge of the graph represents the vocabulary or the corresponding weight [14]. The word graph structure is different from the confusion network structure. Each path in the confusion network structure needs to pass through any node in the directed acyclic graph. The word graph structure has no such restriction, so the expression ability of the word graph structure is stronger. Generally speaking, word graph structure is a compressed expression method, which can represent any finite state machine and express the exponential space of sentences in the polynomial space complexity. The good compression performance of word graph structure makes it available in many fields. In this paper, we use word graph to preserve the translation results of source language sentences based on the structure of word graph structure. Subsequently, we use the set of words appearing in the word graph as the query input of the retrieval model and retrieve them in the index of the target language to be translated to produce candidate results
The schematic diagram of the word graph expression structure of the candidate translation sets.
In the reordering method of candidate translation sets, this paper introduces the commonly used method of word graph structure based on Moses search graph, and propose the word graph structure based on the improved translation option graph.
Moses-based search graph generates word graph structure [15]: The word graph structure of candidate translation sets can be output by setting corresponding decoding parameters in Moses translation model. Because the search space of this method is more than one million levels, the performance of word graph structure depends on the decoding ability of hardware and the setting of algorithm parameters. Translation option graph-based word graph structure [16]: this method deconstructs the input source language sentences into phrase tables, and then all phrase pairs in the phrase tables that can be used to translate the phrases in the source language sentences are translation options, which can be deconstructed into translation options and then to form a word graph structure. The generation method of the translation option graph is different from Moses-based search graph, whose method does not use the language model and order model of machine translation, so it does not need to decode machine translation but build phrase table. Therefore, this method is more efficient. However, there is no language model-based generation, so the accuracy of word graph structure is not as good as the graph-based search generation.
In view of the two generation methods, we define the source language sentence: “Bush and Sharon hold talks”, respectively using Moses-based search graph and translation option diagram to generate word graph structure. Figure 2 gives the results of two different ways to generate word graph.
Results of two different ways to generate word graph structure.
Figure 2a represents the generation result of word graph structure based on Moses search graph. Generally speaking,
Figure 2b displays the result of word graph structure generation based on translation option diagram. As can be seen from the structure of the graph, the method finds out the translation options corresponding to the words and phrases from the sentences of the phrase list, and constructs the word graph structure according to the order of the words appearing in the sentences. Compared with traditional methods, each side is composed of corresponding words, phrases and feature vectors. The method proposed in this paper has the following advantages in the construction of word graph structure:
There is no need for decoding search algorithm in the generation method based on translation option diagram, so it can improve the efficiency of generating word graph structure. Based on the translation option diagram, the words and phrases are reordered according to the order in which they appear, and there are no duplicate edges, so the solution search space is small. The traditional Moses-based search graph method has a huge word-graph structure, which usually needs to delete some edges by pruning algorithm, while the translation-based option graph algorithm does not need to delete the edges by pruning algorithm, so the accuracy of the knowledge space is preserved to the greatest extent.
From the comparison of such two algorithms in Fig. 2, we can see that the Moses-based search graph method has 9 nodes and 10 edges, but can only produce three translation results, and two of them are the same. However, the method based on the proposed translation option diagram has six nodes and nine edges, which form a total of 10 different translation results. Therefore, we can conclude that word graph structure generation based on translation option diagram is a more efficient compression representation method. In addition, according to the experimental study in reference [17], although the phrase model and the ordering model are not considered in the method of generating word graph structure based on translation option diagram, the two model features have little impact on the translation results. Therefore, the method of ignoring the two model features proposed in this paper is effective and reasonable.
In the translation retrieval model, we need to generate word graph structure based on translation option diagram to obtain candidate translation results. In the translation retrieval model, we use
In fact, there is still a huge search space in the search results of word structure. Therefore, we propose an iterative strategy from being coarse to fine to retrieve candidate results. Figure 3a gives an iteration strategy from being coarse to fine based on word graph structure retrieval. The input of the iteration strategy is the set of word graph structure, the candidate corpus and parameters of target language. The GetWords() function is used to obtain the lexical set appearing in the lexical graph structure and form a binary query; the function Retrieve() can input the binary query and return Top-k candidate results from the translation retrieval model; FindPath() function is used in the word graph for each candidate translation result to find the highest score path as the final score. Finally, according to the highest score of candidate translations, the retrieval results can be obtained by descending order. Figure 3b gives the detailed iteration process of function FindPath(). From the calculation process of the algorithm, we can see that the time complexity of the retrieval algorithm is
Iteration strategy from being coarse to fine based on word graph retrieval structure.
After applying the coarse-to-fine iteration strategy based on word graph structure to the translation retrieval model, we can obtain a large number of candidate translation results expressed by word graph structure, which can significantly improve the accuracy of translation retrieval. Subsequently, we apply this framework to parallel sentence pairs extraction in bilingual non-parallel corpus. As parallel sentences for bilingual non-parallel corpus extraction, we use the sentences in the source language as input units, and retrieve the word graph structure translation to get corresponding parallel sentences and the highest score of the parallel sentences. After all the sentences in the corpus get parallel sentences and scores through this way, we filter out the sentences with lower scores by the preset threshold, and obtain high-score parallel sentence pairs with good alignment effect. The threshold setting of different contexts or languages depends on the actual situation. In this paper, the scores of parallel sentence pairs are arranged in descending order, and the first k high-score sentence pairs are taken as the threshold segmentation line, so that parallel sentence pairs can be extracted from non-parallel corpus. Figure 4 gives a parallel sentence pair extraction framework based on word graph retrieval structure.
Statistical results of training, development, and test set in translation retrieval
Parallel sentence pair extraction framework based on word graph retrieval structure.
It can be seen from Fig. 4 that parallel sentence pairs extraction based on word graph retrieval structure mainly consists of three steps:
The word graph structure of candidate translation sets is generated from the source language corpus. In the experiment, we adopt the method of word graph representation based on translation option diagram. Each word in the word graph structure is used as a query item to obtain candidate parallel sentence pairs and scores in bilingual non-parallel corpus. These candidate pairs of parallel sentences are reordered by training the linear logarithmic model of sentence scores, and the sentence pairs whose score is greater than the threshold are retained. Each input retains only one of the highest score parallel sentence pairs.
Word graph translation retrieval experiments and results
In order to verify the feasibility and validity of the proposed translation retrieval algorithm based on word graph structure, we construct training set, development set and test set respectively to test the algorithm [18] by using IELTS 1998–2018 translation topic as the initial corpus. The following Table 1 shows the statistical results of training set and developing set test set. In order to test the performance of the algorithm in different data domains, two comparative experiments are set up in this paper. The training set and the test set come from the same domain for the same translation corpus test, while different translation corpus test are tested by that of different domains.
As can be seen from Table 1 above, the training set contains 1.21M Chinese-English parallel sentence pairs of IELTS translation, 32.0M Chinese vocabulary, 35.2M English vocabularies. In the experiment, we use SRILM tool [19] to construct Moses translation system through quaternary language model, develop a parallel corpus which contains 5K Chinese sentences as the query data set of word graph structure, and adopt 2.5K English sentences and other 2.23M English sentences as the retrieval input set. Because the logarithmic linear model is used to solve the translation retrieval score of word graph structure, we need to optimize the weight of each parameter to get the best result. Therefore, we set up the development set to optimize the feature weight by using the least error rate method, and the test set is used to calculate the accuracy and recall rate of translation results. In order to ensure the correct rate and recall rate have a good effect, select completely different corpus in test set and development set. In the experiment, we first train the translation retrieval model on the training set, then train the model weight on the development set, and finally count the accuracy and recall of the translation results on the test set. The accuracy rate defined in this paper is to calculate the proportion of correct answers included in the first “n” results before returning candidate results, and the recall rate is to calculate the ratio of the number of correct answers contained in the first “n” candidate results to the total number of the existing correct answers.
Comparison results of accuracy and time consumption of four comparison algorithms in the same field data
Comparison results of accuracy and time consumption of four comparison algorithms in the same field data
Comparison results of accuracy and time consumption of four comparison algorithms in different fields data
The benchmark algorithm of translation retrieval experiment based on word graph structure is the traditional translation retrieval model algorithm, and the results of machine translation with 1-best and n-best (10-best and 100-best as examples) preserved in the algorithm are compared. The method based on word graph structure includes word graph generation based on search graph and word graph generation based on translation options proposed in this paper. Tables 2 and 3 show the experimental results of the four algorithms on the same domain data and different domain data respectively.
In the experimental results, “# participants” denotes the number of translation candidates selected for each source language sentence as input for the translation retrieval query. The experimental results show that the method has a better effect on the translation corpus of IELTS teaching in different fields, while the n-best method has a better effect on the translation corpus of IELTS teaching translation in the same field. This method preserves the information of the source language sentence completely, and the error rate of translation is lower and the effect of translation retrieval is better. However, the n-best method in different domains of data due to the problem of domain migration leads to poor translation results, the results of translation retrieval dropped sharply. The translation retrieval method based on word graph structure can preserve more meaningful translation candidates, make the search scope of translation retrieval wider, and ensure that the results of translation retrieval will not be greatly reduced.
Accuracy and recall rate of IELTS education and translation corpus chart.
Figure 5 shows the curves of accuracy and recall in IELTS teaching translation corpus in the same and different fields. The experimental results show that the more candidate translation results we retain in translation retrieval, the higher the final accuracy and recall rate of translation. Compared with the n-best method, the word-graph-based translation retrieval method has improved significantly in accuracy and recall rate, and the improvement in IELTS teaching translation corpus in different fields is more significant.
In order to extract parallel sentence pairs from bilingual non-parallel data sets, we still use Moses machine translation system to retrieve parallel sentences and evaluate the quality of the extracted parallel sentence pairs. In the experiment, we use the IELTS translation corpus as the initial corpus in the translation retrieval experiment based on word graph structure, and use the BLEU evaluation index [20] to evaluate the parallel sentence pair extraction. The following Table 4 gives the statistics of the initial corpus of translation questions about IELTS old exams from 1998 to 2018.
Initial corpus statistics of the initial corpus of translation questions about IELTS old exams from 1998 to 2018
Initial corpus statistics of the initial corpus of translation questions about IELTS old exams from 1998 to 2018
In order to compare the advantages of the proposed non-parallel data model in parallel sentence pair extraction, we use the current best (state-of-the-art) Munteanu [21] parallel sentence pair extraction method as the baseline. This method first determines the document set of translation retrieval, then determines the set of translation sentence pairs, and finally extracts parallel sentence pairs from the corpus by judging the three steps of parallel sentence pairs. This method filters out non-parallel sentence pairs between bilingual cross-language retrieval document sets by heuristic rules, and then obtains sentence pairs for candidate translation retrieval. Finally, the maximum entropy model is used to judge whether each sentence pairs in the candidate pairs are parallel sentences.
Unlike the parallel sentence pair extraction method of Munteanu, which is a benchmark method, the parallel sentence pair extraction method based on word graph translation retrieval proposed in this paper generates word graph representation of candidate translations according to the sentences of the source language, and takes each word in the word graph structure as the query input of machine translation retrieval. The candidate pairs of parallel sentences are reordered by linear logarithm model to obtain the final pairs. Among them, word graph structure is constructed based on translation candidates proposed in this paper. This method is efficient in generating word graph structure and is suitable for large-scale parallel sentence pair extraction. Therefore, this method is used in parallel sentence pair extraction on Bilingual non-parallel corpus.
Comparison between baseline method and the method proposed in this paper in extracting parallel sentences from bilingual non parallel corpus
Through experiments, Table 5 shows the BLEU values for training Moses translation system from IELTS bilingual non-parallel corpus using the benchmark method and the proposed algorithm respectively. From the results of the table, we can see that the BLEU value of the parallel sentence pairs extracted by the algorithm is significantly higher than that of the benchmark method in training the translation system, which proves that the parallel sentence pairs extraction algorithm based on word graph retrieval structure proposed in this paper is effective.
In the translation retrieval experiment based on word graph structure, the experimental results of n-best traditional candidate translation methods in the same field are much better than those in different fields. The main reason is that the poor translation effect in different fields of IELTS translation corpus results in more serious error propagation. As the n of the traditional method increases from 10 to 100, the accuracy and recall rate of the n-best method are improved significantly. Different from this method, both search graph based and word graph structure based on translation candidates can preserve a large number of candidate translation results, and the word graph structure based on translation candidates proposed in this paper has not been pruned, so the results of translation candidates are more. However, because the word graph structure based on translation options does not consider the language model and the ordering model, so the features are simpler and the word graph structure is simpler. Although there are many candidate translation results, the accuracy and recall rate are not as good as the word graph structure retrieval method based on search graph. In conclusion, the two translation methods based on word graph structure are better than b-best retrieval methods. In terms of efficiency, both n-best and search-graph-based word graph structure retrieval methods have higher translation time, while the proposed method does not require translation practice, so the method is more efficient in translation. In the process of translation retrieval, the retrieval time of n-best method is linear with n, while the retrieval time of word graph representation method is linear with the number of edges of word graph. From the experimental results, we can see that the number of edges of word graph structure is not too high, and the retrieval time of 100-best is in the same order of magnitude. Therefore, the retrieval method based on the translation option graph structure proposed in this paper achieves the highest efficiency, so this method is used to extract parallel sentence pairs from bilingual non-parallel corpus.
Parallel sentence pair extraction algorithm based on word graph retrieval structure will retrieve a sentence as the translation result of the source language sentence in the target language corpus for each source language sentence, and give a score to represent the degree of parallelism of bilingual translation.
The bilingual evaluation understudy index (BLEU) is the understudy of a bilingual evaluation index. That is to say, the result of such index is used to take the place of human and evaluated the translation results. Although this index is invented for the evaluation of translation, the index can be used to evaluate a set of text generated by a natural language processing task. The BLEU index is very common in machine translation tasks in natural language processing and is used as an indicator to evaluate the difference between sentences generated by models and actual sentences. The BLEU index matches between 0.0 and 1.0. It is 1.0 if such two sentences are perfect matched, and it is 0.0 if such two sentences are perfect mismatched. As a large amount of IELTS educational basic corpus was used in the construction of bilingual corpus in this paper, during the test process of the corpus, we conducted comparative experiments under the basic corpus of 5M, 10M, 15M, and 20M, respectively (see Table 5).
In the construction of IELTS teaching translation corpus, we can select the first k Parallel results are used as parallel sentence pairs because the output order of parallel sentence pairs is arranged in descending order by parallel scores. On the basis of 5M English vocabulary, the BLEU value of the proposed algorithm is 26.31, which is 2.58 higher than that of the benchmark algorithm. The BLEU promotion of 2.58 is the average of the four cases, so the promotion is completed in the 50M corpus, that is, the average BLEU promotion obtained under the comparison of about 5*107 sentences. The improvement of BLEU can fully reflect the advantage of non-parallel corpus constructed in this paper, as well as the rationality of the bilingual non-parallel data model selected in this paper.
However, with the increase of English vocabulary, the increment of BLEU value of the benchmark system is higher, but this algorithm can maintain a certain proportion of the increase, indicating the robustness of the proposed algorithm in constructing IELTS teaching translation corpus. There are two main reasons for the faster growth of the bench marking algorithm. One is that some parallel sentence pairs can be extracted in the bench marking algorithm, but the lower scores rank behind the candidate sentences. The increase of the vocabulary of IELTS translation corpus may also increase the BLEU value of the translation system. On the other hand, the addition of incomplete parallel sentences caused by the increase of the vocabulary of IELTS translation corpus may also increase the BLEU value of the translation system.
Discussion of the proposed model and corpus
In fact, this paper constructs a non-parallel corpus based on bilingualism. Although this corpus is built on the basis of IELTS education data, it can be applied to various fields related to bilingualism, such as machine translation and natural language processing. On the basis of the corpus constructed in this paper, we can train all the steps needed in machine translation flexibly. For example, training bilingual translators in the context of IELTS education on word segmentation, grammar model, syntax model and so on. In recent years, the rapid development of deep learning has brought new vitality to machine translation. However, the training process of deep learning needs a massive corpus as the foundation, while the bilingual non-parallel corpus constructed in this paper can be used as the massive data samples needed for deep learning model to train complex deep neural network model and apply it to machine translation.
In addition, a Bilingual non-parallel data model used in non-parallel corpora is constructed in this paper, which provides a general method and process for constructing Bilingual non-parallel corpora. Through the process constructed in this paper, we can clearly see that as long as we adopt a similar process and collect a large number of related words, we can build a non-parallel corpus between any bilinguals, such as Japanese, Korean and Latin, etc. First, we can collect a large number of words, sentences and other materials in the bilingual context, then build translation retrieval model based on a large number of corpus, then acquire word graph structure of candidate translation sets, and finally build a parallel sentence pair by word graph retrieval structure. Finally, we will establish a non-parallel corpus of specific two language families as the basis for application in various fields such as machine translation, natural language processing, pattern recognition and artificial intelligence.
Conclusion
Nowadays, IELTS translation education, as one of the mainstream translation education in the world, is constantly seeking a new education model. In this paper, aiming at the problems of small scale, slow speed and incomplete fields in traditional bilingual parallel corpus machine translation, a machine translation system and a translation corpus are constructed for IELTS translation education, so as to better assist IELTS translation education. In the process of construction, for non-parallel corpus, parallel sentence pairs are extracted from the corpus by the translation retrieval framework based on lexicographic representation, and a non-parallel data model is constructed. The experimental results of training IELTS translation education model show that the bilingual non-parallel data model constructed in this paper has good translation results. Compared with the existing algorithms, the BLEU value extracted from parallel sentence pairs is increased by 2.58. In addition, the retrieval method based on the structure of translation option word graph proposed in this paper is time-efficient and has better performance and efficiency in assisting IELTS translation education.
Although the Bilingual non-parallel corpus constructed in this paper has achieved a 2.58 BLEU average improvement, the Bilingual non-parallel data model used in this paper is relatively complex, with high time complexity and space complexity, which wastes a lot of resources in the corpus construction process. In addition, although we have collected 10 years of IELTS educational corpus, the quantity of this basic corpus is still limited. In the face of complex deep neural network training, we also need more corpus as support. Corpus is not necessarily limited to IELTS education, but can build a corpus with better diversity through a wide range of bilingual vocabulary and sentences. Future works include building more targeted non-parallel data models, improving the accuracy and recall rate of parallel sentence pairs extraction, and building more targeted IELTS translation education corpus in the same and different fields.
Footnotes
Acknowledgments
The paper is supported by the fund of Foreign Language Research Project of Fujian Higher Education Institution: Research on the Construction of IELTS Teaching Resource Basis based on Autonomous Learning Mode (JZ170067).
