Issue #90 - Tangled up in BLEU: Reevaluating how we evaluate automatic metrics in Machine Translation

Dr. Karin Sim 16 Jul 2020
The topic of this blog post is evaluation.

Introduction 

Automatic metrics have a crucial role in Machine Translation (MT). They are used to tune the MT systems during the development phase, to determine which model is best, and to subsequently determine the accuracy of the final translations. Currently, the performance of these automatic metrics is judged by seeing how well they correlate with human judgments of translations emanating from various systems. WMT currently uses Pearson’s correlation which is highly sensitive to outliers; as a result the correlation can appear erroneously high. 

As reported by Mathur et al (2020), despite the strong evidence proving the shortcomings of BLEU, it continues to be the industry standard. As this research indicates, there are serious flaws in the way that automatic metrics are evaluated, which we will briefly highlight in this post.

Findings

The most recent WMT (reported in Ma et al., 2019 ) found that with a large number of systems, there were discrepancies in correlation between best metrics and human scores, depending on the number of MT systems under consideration. Given how fundamental the automatic metrics are, this was concerning, leading to the paper by Mathur et al (2020) for which we now report the findings. To judge the correctness of the metrics, Mathur et al (2020) investigates:

  • the effects of using Pearson’s correlation between Direct Assessment (DA) scores whereby humans rate the adequacy of a translation on a scale between 0 and 100 and various automated metrics for the MT systems,
  • how outliers affect these correlations, and
  • how reliable the metrics are for comparing two systems.

They use the following as baseline metrics : BLEU (Papineni et al, 2002), chfF (Popović, 2015), TER (Snover et al, 2006) And find that the best metrics from WMT 2019 task across language pairs are:

  • YiSi-1 and YiSi-2 (Lo, 2019) computes the semantic similarity of phrases in the MT output with the reference and source respectively (see our post a few weeks ago for a summary)
  • ESIM computes the similarity between sentence representations using representations from BERT embedding

They show that:

  1. Current metrics are unreliable when evaluating high quality MT systems: They consider 18 language pairs and top N systems, finding that the correlation between the metric and human scores decreases as N decreases, even for best performing language pairs.
  2. Outlier systems are a real issue - it seems that systems which are of significantly better or worse standard can have a disproportionate effect on the computed correlation. To the extent that there can be a high correlation even when there is none between the metric and DAs otherwise. This leads to false confidence in the reliability of the metric. When the outliers are removed then the differences between the correlation to BLEU and to the other metrics becomes apparent, and it is clear that the other metrics perform much better than BLEU.
  3. BLEU fails to detect what humans judge as significant differences, which is worrying given that BLEU scores determine which model to deploy, or how significant a research finding is. TER behaves similarly. Whereas the instances where chrF, YiSi-1 and ESIM fail to detect a difference are ones where the difference in human score is lower anyway.

In summary

The work by Mathur et al (2020) shows how current practices for assessing evaluation metrics are flawed. Apart from the issue with Pearson’s correlation coefficient which unduly affects the correlations between metric scores and human judgements, this means that it is hard to judge whether the metrics are less reliable when evaluating high quality MT systems. This is becoming a bigger issue with improvements in NMT. More serious is the fact that outliers may lead to the correlation being erroneously inflated, hiding the failures of BLEU. They conclude that a good way of gaining insight into metric reliability is via visualisations of the DA and metric scores. They also suggest that small improvements in BLEU score are potentially meaningless, and argue for disregarding BLEU as the standard automatic metric, and using chrF, YiSi or ESIM instead.
Tags
Evaluation
Dr. Karin Sim
Author

Dr. Karin Sim

Machine Translation Scientist
Karin is a Machine Translation Scientist, who trains customer specific machine translation engines. Her first career was as a conference interpreter, following which she retrained and worked as a software engineer for a number of years. Following a coursera course on NLP, she decided to undertake a PhD in Machine Translation at the University of Sheffield, focused on document level evaluation. She subsequently moved back to industry, as she likes solving real problems in a practical way. She is able to bring insights from her interpreter and translator background to inform her task in training MT engines, in addition to leveraging her software engineering experience for the implementation side.
All from Dr. Karin Sim