Evaluating the quality of machine translation is crucial to improving its output. But what are the best metrics of MT quality and what purposes do they serve?
There are two broad types: human and automated evaluations. While comprehensive human evaluation is often the most effective solution, it’s also subjective, time-consuming and costly. That’s why industry academics introduced standard, automated metrics we can use at scale to measure how well MT performs. Many studies have shown that these metrics are even able to imitate human evaluation results.
And with Neural MT (NMT), the demand for data-driven ways to quantify MT quality has grown. NMT has significantly different output characteristics than Statistical MT (SMT), so researchers are looking at producing new metrics that can more reliably evaluate NMT output.
To that end, we brought in our Senior Solutions Architect Miklós Urbán to enlighten us on the current state of affairs.
Let’s start with an overview of the automated metrics used today. Which do we use at RWS Moravia?
There are number of them, but we primarily use two. One is the BLEU score, the industry’s first-ever commonly used metric, which works by comparing existing translations. Let’s imagine you have a sample source text translated twice: once by humans and once by MT. The BLEU score is the proportion of words that appear in MT which also exist in the human translation (the “golden reference”).
When BLEU became widespread 10 to 15 years ago, everybody accepted it as most in line with how humans would evaluate translations. It is still used widely despite having well-known limitations. For example, it doesn’t deal well with synonyms or grammatical word changes, and it’s also very unbalanced because it compares only in one direction: MT to the human reference.
Then there is a metric called METEOR. METEOR’s algorithm is more nuanced because not only does it compare MT and the human reference in both directions, it also takes into account things like linguistics. While BLEU checks existing words exactly as they appear, METEOR considers some linguistic variants. In English, “ride” or “riding” would count as two different words for the BLEU score. But for METEOR, it would count as a single word because they have the same root.
That’s why we generally use METEOR in more cases than we use BLEU. These nuances can affect the accuracy of the quality measurement.
So, BLEU and METEOR measure the difference between MT and human outputs. What about other metrics? Do they measure the same thing?
There is actually another purpose for automated metrics, and that’s to measure the effort of post-editors—the humans who review and revise MT translations to fix inaccuracies. We measure the difference between the MT output and the post-edited output in terms of the number of changes, which could include deleting, replacing and adding words. A formula would calculate the number of these edits and give a numerical result.
What metrics do we use to measure effort?
We use two kinds of metrics for this, too. One is called the Levenshtein distance, which calculates the difference between the MT output and the post-edited translation. It shows what the post-editor did to the original MT output. Let’s say the machine translation output is “the cat is barking,” and a post-editor changes this to “the dog is barking.” The difference would be six, because you include the three letters deleted and three letters added when editing from “cat” to “dog.” Then you divide six by the number of letters in the whole segment to come up with a result that is a percentage.
The second metric we use to measure the human effort of post-editing is the TER score. Whereas the Levenshtein distance counts on a character level—which characters are deleted, added or replaced—the TER score tries to account for the kinds of changes made and makes a calculation based on the number of edits rather than the number of character changes.
Again, take the example “the cat is barking” and “the dog is barking.” The Levenshtein distance counts both the three letters deleted and the three letters added. When you calculate TER, it recognizes a single replacement: one string is replaced with another. That string has a length of three. So, it calculates a single edit with a length of three characters.
Therefore, Levenshtein can actually overestimate the effort of making long edits that are in fact only single edits—for example, if you replace one or two characters here and there throughout a long sentence. Levenshtein wouldn’t be able to tell the difference in effort between that and overwriting full words. In this case, TER is more reliable because its logic is closer to the actual post-editing effort.
How do all these automated metrics determine the quality of MT?
Well, the thing about automated evaluations is that they try to imitate the outcomes of human evaluations. But in the end, automated evaluations can only show the percentage that represents the difference between MT output and human or post-edited translations.
A human evaluation, on the contrary, can be a lot more granular; humans can give a more detailed overview of MT quality. We usually use the TAUS DQF benchmark to guide human evaluations, and in doing so, we get a better understanding of the different aspects of language quality, like accuracy (how well the message is conveyed) and fluency (spelling and grammar), whereas the single figure returned by automated metrics is more susceptible to accuracy.
Fluency is a lot more difficult to measure because linguistically it’s subjective. But we could make automated metrics more sensitive to fluency by developing them to examine groups of words together, or so-called n-grams (where “n” stands for the number of consecutive words). The theory is, the larger the groups of words that appear in the same order in both MT and human translations, the more fluent the MT output.
Do you have any parting thoughts on the topic of MT measurement?
If I could get one wish, my main goal would be to develop one standard automated metric that is reproducible—so the same algorithm gets you the same numerical result for each piece of content, and ideally is agile enough to track results on a long-term basis. This would help us compare the performance of MT engines more reliably.
When we talk about technologies that use machine learning, bias is always a key question. MT is no different—all the metrics rely on some human input, whether it’s a golden reference translation or a post-edited translation, which varies from person to person. Therefore, comparing results for different languages has its limitations. So does comparing results over time.
Let’s say we measured the quality of a client’s MT engine with sample content that was considered new at the time. But as the engine continues to support their translation process, the client’s business evolves to include new products or features, thus newer content. Can we still say the MT engine is as good as we measured earlier? Or has it degraded? If we train a new engine on new translations, can we say the new engine is better based on evaluations using the original sample? That sample is already somewhat old—in other words, biased to the time period in which it was created. These are questions our experts cope with daily.
Ultimately, the questions we want answered by automated metrics may sound more straightforward than they really are: “Which is the better MT engine?” or “Is the engine good enough to make post-editing more efficient then translating from scratch?” And if so, “How much more efficient?” Given the flaws in the metrics currently available, we need to take a critical approach to interpreting results in their context. That’s why it would be great if we could better standardize automated MT measurement while accounting for our clients’ evolving needs.
Of course, automated metrics will continue to influence decisions about using one engine over another, but it’s still up to humans to make the final call. And that in itself is a tricky business because human evaluation involves multiple methodologies.
In the meantime, humans still have an important job to do: making sense of these metrics and their role in comparing MT systems. The field of MT evaluation is evolving as fast as the systems themselves, but your language services provider should be able to advise you on the best use of automated metrics and human evaluation methodologies for your languages, content types and use cases.
Does your localization team favor one metric over another? Why? Let us know in the comments below or contact us to talk machine translation strategy.