Abstract
Errors and residuals are closely related measures of the deviation. An error is a deviation of the observed value (PEMT output) from the expected value (MT output), while the residual of the observed value is the difference between the observed and predicted value of quality. We propose an exploratory data technique representing an ideal instrument to evaluate and improve machine translation (MT) systems. The main contribution consists of a rigorous technique (a statistical method), novel to the research of MT evaluation given by residual analysis to identify differences between MT output and post-edited machine translation output regarding human translation (reference). The residual analysis of the automatic metrics can help us to discover significant differences between MT and PEMT and to identify questionable issues regarding the one reference. In this study, we show the usage of residuals in MT evaluation. Using residual analysis, we identified sentences, in which significant differences were found in the scores of automatic metrics between MT output and post-edited (PE) MT output from Slovak into English.
Introduction and related work
Due to the wide-spread development of CAT tools (Computer-Assisted Translation) and MT (Machine Translation) systems and their extensive use in the translation process, the evaluation of MT became more and more important. In MT evaluation, optimisation is closely related to quality. Based on the results of MT evaluation, the examined MT system is optimised. Assessing the quality of MT output can be conducted in two different ways- manually and automatically. The manual or human approach comprises 1) traditional human assessment methods [11, 29] such as fidelity and intelligibility (used by the Automatic Language Processing Advisory Committee, ALPAC, [16]), or later applied fluency, adequacy and comprehension (Advanced Research Projects Agency, ARPA, [19]). They are often evaluated at the sentence level in a scale from 1 (worst) to 5 (best); 2) advanced methods of human assessment such as ranking [32–34], it is the official MT evaluation method of WMT (Workshop on Statistical Machine Translation) campaigns since 2007, and similar to fluency/adequacy, is evaluated at the sentence level in scale from 1 (worst) to 5 (best), or task-oriented methods [8, 18], which are more informative as previous ones, or post-editing [1, 38] performed by humans (however, there is tendency to automate post-editing) consisting of checking and, when necessary, correcting MT output. Methods of manual MT evaluation require human translator knowledge, assessing the quality of MT output along the two axes- language correctness and semantic fidelity. The human evaluation would be the most desirable, especially the errors analysis, but it is intuitive or subjective, time and labour consuming and also expensive. Vilar et al. [9] pointed out that subjectivity of evaluators causes a problem regarding biased judgments towards MT as well as having no clear definition of a numerical scale by translation quality assessment. To overcome the issues with human evaluation, automatic evaluation metrics have been proposed and became a common way of the MT evaluation because they are fast, cheap, reusable and language-independent. They compute similarity -matches based on comparisons between a set of references (fixed human translations) and the corresponding MT output. Automatic evaluation also requires human interaction, but it is much less time-consuming and expensive compared to human evaluation (texts are translated by humans only once and can be used many times). Automatic metrics can be classified into three groups: based on lexical similarity [3, 37], they measure the concordance in words within the sequences, word order and also edit distance; based on syntactic similarity [2, 27], they take sentence structure into account such as part-of-speech, or based on semantic similarity [5–7], they take synonyms or named entity into account. However, the results (the scores of these metrics) are only quantitative and do not inform or identify the suspicious segments or sentences. The traditional way to evaluate automatic metrics performance is through ranking, i.e. human evaluators are asked to rank sentence by sentence of MT output of different MT systems (usually 4 MT systems, but it depends on language pairs). These rankings using Pearson correlation coefficient or Kendall’s τ are used to evaluate the automatic metrics [32–34]. They provide only quantitative scores, which MT system provides better translation regarding automatic metrics, but do not give us much detail about the error types which occurred in MT output. Automatic metrics are mainly designed for the evaluation and comparison of MT systems and not for translation quality assessment.
In this study, we present another method, which can be used in MT evaluation – residual analysis. We propose an exploratory data technique representing an ideal instrument to evaluate and improve MT systems. The main contribution consists of a rigorous technique (a statistical method), novel to the research of MT evaluation given by the residual analysis to identify differences between MT output and post-edited machine translation output regarding the human translation (reference).
Residuals
Researchers in physical, economic, social or behavioural sciences or medicine even in translation studies (human or machine) are interested in variables which are unobservable, measured with an error. This type of latent variable can be, for example, a quality of machine translation, which is not directly observable. Besides that language is considered as a complex system of features and relationships among them.
Errors and residuals are closely related measures of the deviation. The error is a deviation of the observed value (PEMT/reference) from the expected value (MT), while the residual of the observed value is the difference between the observed and predicted value of the quality:
Residual analysis is one of the ways how to evaluate a model (MT output). It helps us to assess the adequacy of the model [25]. Roger W. Hoerl [36] claimed that analysis of the residuals is an effective method for assessing the fit of the model to the data and determining whether the model is useful. Residuals can be thought of as elements of variation unexplained by the fitted model [31]. For standard (normal) linear models, residuals are used to verify homoscedasticity, the linearity of effects, the presence of outliers, normality and independence of the errors [17]. According to Bollen and Arminger [21], the unstandardized residuals can be plotted to help identify unusual values of the residuals in comparison to the others, while the calculated values of residuals depend on the metric used to measure the observed variables (in our case, metrics of automatic MT evaluation like, metrics of accuracy and error rate).
Evaluation of MT output can be viewed as an evaluation of the validity of assumptions of a statistical model. Residuals allow us to identify patterns, better understand and interpret problems of the model and subsequently eliminate, correct, analyse them or analyse their influence on the quality of machine translation (e.g. mistakes in machine translation). Such analysis has not been widely used or applied for textual data. Topp and Gomez [35] developed residual analysis for censored data (illustrated on a data set from an AIDS clinical trial study). Hildreth [23] showed the use of residual analysis in structural equation modelling (a statistical methodology used to examine causal relationships in observational data), which is used in the social and behavioural sciences due to its ability to model complex systems of human behaviour.
Experiment
The goal of this experiment was to identify sentences, in which significant differences were found in the scores of automatic metrics between MT output and post-edited (PE) MT output from Slovak into English. The residual analysis of the automatic metrics can help us to discover significant differences between MT and PEMT and to identify questionable issues regarding the one reference. We calculated automatic scores by relying on one reference. We used our tool to align and compute the automatic scores of metrics of accuracy and error rate. Further, we used these metrics as variables for residual analysis at the sentence level. A MT system - Google Translate (GT) was used. It is a free web translation service (and only one) offering translation from/to a low-resource language (Slovak).
Dataset
We create a dataset, which covers 360 sentences translated only in one direction- from an inflectional language (Slovak) into an analytical language (English). We chose this direction only for the following reason: when we have translated a text from a foreign language into our mother tongue, we obtained lower scores of automatic metric BLEU (resulted from our pre-research results). If we wanted to achieve a higher score for BLEU metrics, we had to examine vice versa translation (BLEU metric is not suitable for inflective languages, e.g. one English noun has six forms in Slovak differing only in suffixes, and also in comparison to English, the Slovak language has a loose word order). These 360 sentences written in Slovak were translated by a statistical machine translation system and consequently post-edited by two professional translators (P) and 12 master’s students of Translation Studies (S). Reference was created by two translators and one native speaker, whose mother tongue is English but also speaks Slovak. After machine translation and post-editing, our dataset is composed of 360 machine translated sentences and 5040 post-edited sentences.
Evaluation methods
WER (Word Error Rate, error-based metric) is based on the edit distance, takes into account the word order, and accounts the Levenshtein distance between a hypothesis (MT output) and a reference (human translation). It calculates the minimum number of edits (insertion, deletion, and substitution) needed to transform the hypothesis into the reference, i.e.
PER (Position-independent Error Rate) is similar to WER, but it does not take into account the word order. It considers the reference and hypothesis as bags of words and counts the number of times that identical words appear in both sentences (MT and reference), i.e.
CDER (Cover Disjoint Error Rate) is a measure oriented towards recall but based on the Levenshtein distance. It uses the fact that the number of blocks in a sentence is the same as the number of gaps between them plus one. It requires both hypothesis and reference to be covered completely and disjointly. Only words in the reference must be covered only once, while in the hypothesis they can be covered zero, one or more times, i.e.
MT can be evaluated by using precision, recall, and f-measure metrics (well known and widely used in natural language processing). They are based on lexical similarity, i.e., correct word matches between MT output (hypothesis) and reference. Precision is the proportion of words in MT that are present in reference, i.e.
Recall is the proportion of words in reference that are present in MT, i.e.
F-measure is a harmonic mean of precision and recall.
BLEU (Bilingual Evaluation Understudy, [20]) is a geometric mean of n-gram precisions (for n-gram of size 1– 4) and the second part is a brevity penalty (BP), i.e. length-based penalty to prevent very short sentences as compensation for inappropriate translation.
S means hypothesis (h) and r reference in the complete corpus C.
The BLEU reflects two aspects of translation quality- adequacy and fluency by calculating word or lexical precision.
We used residuals to compare the scores of automatic metrics of PEMT with MT at the sentence level. In our case the analysis composed of residual analysis is defined as follows:
We used a rule ±2σ to identify the extreme values:
By aggregating the scores of automatic metrics by weighted average, we created one variable for each metric. The residual values above the residual mean predicate the above average assessment of PEMT using the automatic metric in comparison to MT output. On the contrary, residual values below the residual mean predicate the below average assessment of PEMT output using the automatic metric in comparison to MT output. Identification of extreme values helps us to detect sentences, in which the significant differences between MT and PEMT were found (translation from Slovak into English).
Results
Regarding the number of examined sentences (360 sentences for evaluation), we introduce only the first 37 sentences. The graphs (Figs. 1–3) depict the scores of the precision, recall, and f-measure PEMT and MT output from Slovak into English and their residuals′ values.

Results of residuals analysis of the precision metric of MT and PEMT.
Some sentences translated by MT system are more close to a reference as PEMT ones such as sentences 3, 11 or 14 (green colour). On the other hand, PEMT sentences like sentences 9 or 29 achieved higher scores of the precision than MT sentences.
Sentences 2, 29 and 31 (Figs. 1–3) were above average assessed by the precision, recall, and f-measure metrics regarding the accuracy of PEMT. Besides, sentence 35 (Figs. 2 and 3) based on the recall and f-measure metrics was also above average assessed regarding the accuracy of PEMT. The sentences were the closest to a reference translation. By contrast, sentence 14 (Figs. 1–3) was below average assessed by the precision, recall, and f-measure metrics regarding the accuracy of PEMT against MT. Similarly, sentence 11 (Fig. 1) based on the precision and sentence 18 (Figs. 2 and 3) based on the recall and f-measure were also below average assessed regarding the accuracy of PEMT.

Results of residuals analysis of the recall metric of MT and PEMT.
A significant difference was found in the case of sentence 29 (Figs. 1 and 3) in the value of precision and f-measure between PEMT and MT in favour of PEMT, and in case of sentence 14 (Figs. 2 and 3) in recall and f-measure values in favour of MT.
The graphs (Figs. 4–6) depict the scores of the automatic metrics of error rate (WER, PER, and CDER) of PEMT and MT output from Slovak into English and their residuals' values. The residual values above the residual mean predicate the above average error rate of PEMT using the automatic metric in comparison to MT output. On the contrary, residual values below the residual mean predicate the below average error rate of PEMT output using the automatic metric in comparison to MT output. Identification of extreme values helps us to detect sentences, in which significant differences in the error rate between MT and PEMT were found (translation from Slovak into English).

Results of residuals analysis of the f-measure metric of MT and PEMT.

Results of residuals analysis of the PER metric of MT and PEMT.
Sentence 14 (Figs. 4–6) had an above average error rate determined by the PER, WER, and CDER metrics. Moreover, sentence 18 (Fig. 4), based on the PER and sentence 35 (Fig. 5), also based on the WER metric, also had an above average error rate. On the other hand, the PEMT of sentences 2 and 29 (Figs. 4–6) had a below average error rate determined by the PER, WER, and CDER metrics. Likewise, based on the PER, PEMT of sentences 31 and 35 (Fig. 4) had a below average error rate.

Results of residuals analysis of the WER metric of MT and PEMT.

Results of residuals analysis of the CDER metric of MT and PEMT.

Values of the BLEU-1 metric of PEMT and MT English output.

Results of residuals analysis of the BLEU-1 metric of MT and PEMT.

Values of the BLEU-2 metric of PEMT and MT English output.
A significant difference was found in the case of sentence 14 (Figs. 4 and 6) in the PER and CDER, and in the case of sentence 35 (Fig. 5) in the value of the WER metric between the PEMT and MT in favour of MT output. On the other hand, a significant difference was found in the values of WER and CDER for sentence 29 (Figs. 5 and 6), and for sentence 2 (Fig. 5) in WER between PEMT and MT in favour of PEMT output.
The graphs visualise the values of the automatic metric of accuracy, i.e. BLEU-n, (n = 1 to 4) of the PEMT and MT output (Figs. 7, 9, 11 and 13). The graphs (Figs. 8, 10, 12 and 14) depict the residuals’ values of BLEU metric. Similar to precision, recall and f-measure, the residual values above the residual mean predicate the above average accuracy of PEMT using the automatic metric in comparison to MT output. On the contrary, residual values below the residual mean predicate the below average accuracy of PEMT output using the automatic metric in comparison to MT output. Identification of extreme values helps us to detect sentences, in which the significant differences in accuracy between MT and PEMT were found.

Results of residuals analysis of the BLEU-2 metric of MT and PEMT.

Values of the BLEU-3 metric of PEMT and MT English output.

Results of residuals analysis of the BLEU-3 metric of MT and PEMT.

Values of the BLEU-4 metric of PEMT and MT English output.
The PEMT of sentence 29 (Figs. 8, 10, 12 and 14) was above average assessed by BLEU-n (n = 1 to 4). Besides, based on the score of BLEU-1 the PEMT of sentences 31 and 35 (Fig. 8) were above average assessed regarding accuracy. Further, based on the BLEU-2, the PEMT of sentence 10 was above average assessed regarding accuracy (Fig. 10). Similarly, the PEMT of sentence 25 (Figs. 12 and 14) was above average assessed by the BLEU-3 and BLEU-4 metric. On the contrary, based on the scores of the BLEU-n (n = 1 to 4), the PEMT of sentences 14 and 18 were below average assessed in terms of accuracy (Figs. 8, 10, 12 and 14). These sentences were more similar to reference when they were translated by machine than post-edited. Likewise, the PEMT of sentence 19 (Fig. 14) in respect of the score of the BLEU-4.

Results of residuals analysis of the BLEU-4 metric of MT and PEMT.
The significant difference was found in sentence 29 in the scores of the BLEU-n (n = 1 to 4) between PEMT and MT in favour of PEMT output (Figs. 8, 10, 12 and 14). On the other hand, a significant difference was found in sentence 14 (Figs. 8, 10, 12 and 14) in the scores of the BLEU-n (n = 1 to 4) between MT and PEMT in favour of MT output. In other words, MT was more similar to reference than PEMT.
The aim of the study was an examination of two variables – post-edited machine translation (PEMT) output and machine translation (MT) output, both through the residuals and the use of reference. Based on the residual analysis of automatic evaluation (using metrics of automatic MT evaluation) of PEMT and MT, we found significant differences in the score of automatic MT metrics of PEMT and MT at the sentence level (Figs. 1–6, 8, 10, 12 and 14). A significant difference was found in the sentence 29 in favour of PEMT and vice versa in sentence 14 in favour of MT (within the recall and precision of the translation).
ST: Strojový preklad sa často používa ako:
MT: Machine translation is often used as:
Ref.: Machine translation is often used as:
PEMT_S1: Machine translation is often used as:
PEMT_P1: Machine translation is often used as:
ST: Ročné náklady EÚ na preklady dokumentov pre jednotlivé krajiny v roku 2007 predstavovali približne 400 miliónov eur.
MT: The annual cost to the EU of translations of documents for each country in 2007 amounted to approximately € 400 million.
Ref.: The EU’s annual document translation cost for each country in 2007 amounted to approximately € 400 million.
PEMT_S2: The annual cost to the EU for translations of documents for each country in 2007 amounted to approximately € 400 million.
PEMT_S9: The annual cost of translations of documents for the EU for each country in 2007 amounted to approximately 400 million €.
PEMT_P1: The annual EU cost of the translated documents for each country was approximately € 400 million in 2007.
Due to the accuracy and coverage of the words, a significant difference was in sentence 29 in favour of the PEMT and sentence 14 in favour of the MT. Sentence 29 is very specific, it was correctly translated by MT system, but the sentence was too short (consisting of six words), which resulted in a penalty for shortness. MT sentence 14 is covered by more words, than the reference 14. Additionally, more words in MT, regardless of the syntax, are the same as in the PEMT sentence than in the reference.
Due to the error rate, a significant difference was in sentences 29 and 2 in favour of PEMT.
ST: Strojový preklad (Machine Translation, d’alej len MT) sa stal v poslednom desat’ročí významným predmetom výskumu nielen v akademickej sfére, ale aj v komerčnej sfére.
MT: Machine translation (Machine Translation, hereinafter MT) has become in the last decade significant research subject not only in academia, but also in the commercial sphere.
Ref.: Over the last decade, machine translation (MT) has become an important area of study and research, for both commercial and academic purposes.
PEMT_S12: In the last decade machine translation (Machine Translation, hereinafter MT) has become a significant subject of research not only in academia, but also in the commercial area.
PEMT_S10: In the last decade machine translation (hereinafter MT) has become a significant subject of research not only in academic sphere, but also in the commercial sphere.
PEMT_P1: In the last decade machine translation (Machine Translation, hereinafter MT) has become significant research subject not only in academic, but also in the commercial sphere.
In the case of the MT of sentence 2, there is a strong sentence structure and lexicon (vocabulary) is known. It did not cause a problem through the minimal edit operations to transform the MT sentence to correct one.
Based on the residuals we can only analyse the sentence with extremes and then identify the major MT errors. With a closer look at the reported extreme sentences in the whole dataset we could identify several relevant MT errors. In our case, the most frequent MT errors composed of syntactical (word order and agent), lexical (omitted, mistranslated and synonyms), morphological errors (articles and passive), and punctuation.
Residual analysis is very useful if we do not want to make error analysis of the whole MT output which is not only time-consuming and expensive but often disputable. Our approach to the evaluation of MT systems (using automatic MT metrics) is original, representing an ideal instrument to evaluate and improve MT systems. Besides, this extreme identification (using residuals) can help us to detect major MT errors.
Footnotes
Acknowledgments
This work was supported by the Slovak Research and Development Agency under the contract No. APVV-14-0336.
