Issue #90 - Tangled up in BLEU: Reevaluating how we evaluate automatic metrics in Machine Translation
Introduction
Automatic metrics have a crucial role in Machine Translation (MT). They are used to tune the MT systems during the development phase, to determine which model is best, and to subsequently determine the accuracy of the final translations. Currently, the performance of these automatic metrics is judged by seeing how well they correlate with human judgments of translations emanating from various systems. WMT currently uses Pearson’s correlation which is highly sensitive to outliers; as a result the correlation can appear erroneously high.
As reported by Mathur et al (2020), despite the strong evidence proving the shortcomings of BLEU, it continues to be the industry standard. As this research indicates, there are serious flaws in the way that automatic metrics are evaluated, which we will briefly highlight in this post.
Findings
The most recent WMT (reported in Ma et al., 2019 ) found that with a large number of systems, there were discrepancies in correlation between best metrics and human scores, depending on the number of MT systems under consideration. Given how fundamental the automatic metrics are, this was concerning, leading to the paper by Mathur et al (2020) for which we now report the findings. To judge the correctness of the metrics, Mathur et al (2020) investigates:
- the effects of using Pearson’s correlation between Direct Assessment (DA) scores whereby humans rate the adequacy of a translation on a scale between 0 and 100 and various automated metrics for the MT systems,
- how outliers affect these correlations, and
- how reliable the metrics are for comparing two systems.
They use the following as baseline metrics : BLEU (Papineni et al, 2002), chfF (Popović, 2015), TER (Snover et al, 2006) And find that the best metrics from WMT 2019 task across language pairs are:
- YiSi-1 and YiSi-2 (Lo, 2019) computes the semantic similarity of phrases in the MT output with the reference and source respectively (see our post a few weeks ago for a summary)
- ESIM computes the similarity between sentence representations using representations from BERT embedding
They show that:
- Current metrics are unreliable when evaluating high quality MT systems: They consider 18 language pairs and top N systems, finding that the correlation between the metric and human scores decreases as N decreases, even for best performing language pairs.
- Outlier systems are a real issue - it seems that systems which are of significantly better or worse standard can have a disproportionate effect on the computed correlation. To the extent that there can be a high correlation even when there is none between the metric and DAs otherwise. This leads to false confidence in the reliability of the metric. When the outliers are removed then the differences between the correlation to BLEU and to the other metrics becomes apparent, and it is clear that the other metrics perform much better than BLEU.
- BLEU fails to detect what humans judge as significant differences, which is worrying given that BLEU scores determine which model to deploy, or how significant a research finding is. TER behaves similarly. Whereas the instances where chrF, YiSi-1 and ESIM fail to detect a difference are ones where the difference in human score is lower anyway.