Over the years, BLEU has become the “de facto standard” for Machine Translation automatic evaluation. However, and despite being the metric being referenced in all MT research papers, it is equally criticized for not providing a reliable evaluation of the MT output. In today’s blog post we look at the work done by Freitag et al. (2020), who investigate to which extent BLEU (Papineni et al., 2002) is to be blamed, as opposed to the references used to evaluate MT output.
In our very first blog post on evaluation (#8), Dr. Sheila Castilho was already questioning the quality of the data we use. She questioned whether MT evaluation results could be trustworthy if the quality of the data sets are very poor. In issues #80 and #81 we also reviewed a set of recommendations made recently by Läubli et al. (2020) for performing MT evaluations aimed at assessing Human Parity in MT. Finally, in issues #87 and #90 we explored, respectively, the semantic MT quality evaluation metric YiSi proposed by Lo (2019) and an assessment of current evaluation metrics done by Mathur et al. (2020). Freitag et al. (2020) carry out a study to assess the importance of references in evaluating Machine translation output. Additionally, they assess the correlation between different references and human judgements. They validate their proposed approach not only against BLEU, but also against all modern evaluation metrics, including embedding-based methods like YiSi (if you are curious about this relatively new MT evaluation metric, check out our recent #87 blogpost!).
A bit of contextBefore we delve into the details of the paper, it is worth explaining briefly a concept often referred to when scholars are analysing translations: the concept of translationese (Baker et al., 1993). When a text is translated, that translation usually exhibits a set of characteristics that would differentiate it from a text originally written in the target language (not translated). More concretely, it has been observed that translations, when compared to texts originally written in the target language, typically are simpler, with less lexical variation and that oftentimes they resort to mechanisms like explicitation. On top of that, the influence of the source text on the translations has also been researched and scholars like Toury (2012) found evidence of that. An example of such influence would be a relatively similar sentence structure to the one of the source text being used, instead of a more natural sentence structure in the target language. This is because translators usually follow the source text and translate it in the same order as it is presented.
Research questionsFreitag et al. (2020) take as a starting point the findings around translationese to research how different mechanisms to generate translation references could affect the evaluation of MT output. More specifically, they aim at tackling the following questions:
- How can new, reliable, translation references be generated? In particular, they are interested in references that yield a positive correlation with human judgements in MT evaluation.
- Can paraphrased translations be used as an alternative way of using translations for MT evaluation? They only explore paraphrases created by linguists, and not automatic approaches to generate paraphrases.
- Which type of reference among the ones generated has the highest correlation with automatic evaluation metrics?
The experimentsThey set up an initial experiment whereby they generated three alternative references for the WMT 2019 English into German news translation task:
- A new translation from scratch of the source text;
- A paraphrased version of the official reference used in the shared task; and
- A paraphrased version of the new reference they created translating from scratch.
All new references were created by linguists and in the case of paraphrases they were instructed to paraphrase the texts as much as possible. They then validated all references including the original one from the shared task. To do so, they asked linguists to assess their adequacy (how much of the original meaning is preserved?) against the original text. Interestingly, the new translation they generated from scratch obtained a higher adequacy score than the original one, and the paraphrased versions seemed to be worse. However, if they combined the different references using the translations with higher scores in each case, the combination of the paraphrased references seemed to be better than the translations from scratch alone. What they did not manage to beat, though, was the combination of the two translated references, which yielded better results. Unsurprisingly, the best scores were obtained if all four references were combined. The authors acknowledge that further research is needed, but point out the possibility of translationese interfering with the adequacy assessment: if the reference had been paraphrased and was not following the source text structure, for instance, it may have caused the evaluators to rate it lower.
As a next step, the authors then compared the rank-correlations of BLEU evaluating the 22 submissions to the WMT 2019 news translation shared task (Barrault et al. 2019) with each of the references (both as single references and combined into one single reference). In all cases, the highest correlation was achieved when using the combined references individually, as well as the paraphrased WMT official reference. Against the usual belief that computing BLEU using multiple references yields better correlations with human judgements, the authors also show that this was not the case in their experiment. In fact, the best correlations were obtained when the best segments of each reference are selected and used as a single reference. Similar observations were made when assessing the alternative evaluation metrics TER, chrF, METEOR, BERTScore, and Yisi-1. In particular, YiSi seemed to benefit most from paraphrases.
Their paraphrased references approach showed particularly good correlation with the top submissions of WMT. It additionally proved to correctly distinguish baselines from systems known to produce translation outputs considered more natural, such as those augmented with either back translations or automatic post-editing. These systems tend to be penalised more by automatic metrics, as they tend to be more distant to the reference translations. The authors claim that this is due to the translationese artifacts present in translations used as references.
Before concluding, the authors also tested the different references to identify their degrees of translationese. More concretely, and following up the work by Toral (2019), they test the following 3 translationese artifacts:
- Lexical variety: the lower the number of unique words, the higher the level of translationese
- Lexical density: the lower the percentage of content words, the higher the level of translationese
- Length variety: the lower the variance between the source and the target sentence length, the higher the level of translationese