Understanding machine translation quality: What really matters?

Kirti Vashee Independent Language Technology Consultant 23 Apr 2019

8 min read

This is the second post in our posts series on machine translation quality. The first one focused on BLEU scores, which are often improperly used to make quality-based decisions where it clearly is not the best metric.

The use of machine translation (MT) in the translation industry has historically been heavily focused on localization use cases, with the primary intention to improve efficiency, that is, speed up turnaround and reduce word cost. Indeed, machine translation post-editing (MTPE) has been instrumental in helping localization workflows achieve higher levels of productivity.

Many users in the localization services industry select their MT technology based on two primary criteria:

Lowest cost
“Best quality” assessments based on metrics like BLEU, Lepor or TER, usually done by a third party

The most common way to assess the quality of an MT system output is to use a string-matching algorithm score like BLEU. As we pointed out previously, equating a string-match score with the potential future translation quality of a system in a new domain is unwise and quite likely to result in disappointing results. BLEU and other string-matching scores offer the most value to research teams building and testing MT systems.

Many users consider only the results of comparative evaluations – often performed by means of questionable protocols and processes using test data that is invisible or not properly defined – to select which MT systems to adopt. Most frequently, such analyses produce a score table like the one shown below, which might lead users to believe they are using the “best-of-breed” MT solution since they selected the “top” vendor (highlighted in green).

English to French	English to Chinese	English to Dutch
Vendor A – 46.5	Vendor C – 36.9	Vendor B – 39.5
Vendor B – 43.2	Vendor A – 34.5	Vendor C – 37.7
Vendor C – 42.5	Vendor B – 32.7	Vendor A – 32.5

While this approach looks logical at one level, it often introduces errors and undermines efficiency because of the administrative inconsistency between different MT systems.

Assessing business value and impact

The first post in this blog series exposes many of the fallacies of automated metrics that use string-matching algorithms (like BLEU and Lepor), which are not reliable machine translation quality assessment techniques as they only reflect the calculated precision and recall characteristics of text matches in a single test set, on material that is usually unrelated to the enterprise domain of interest. The issues discussed challenge the notion that single-point scores can really tell you enough about long-term MT quality implications.

The enterprise value equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:

Adaptability to business use cases
Manageability
Integration into enterprise infrastructure
Deployment flexibility

To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.

But what would more dynamic and informed approaches look like? MT evaluation certainly cannot be static since systems must evolve as requirements change. Instead of a single-point score, we need a more complex framework that provides an easy, single measure that tells us everything we need to know about an MT system. Today, this is unfortunately not yet feasible.

A more meaningful evaluation framework

While single-point scores do provide a rough and dirty sense of an MT system’s performance, it is more useful to focus testing efforts on specific enterprise use case requirements. This is also true for automated metrics, which means that scores based on news domain tests should be viewed with care since they are not likely to be representative of performance on specialized enterprise content.

When rating different MT systems, it is essential to score key requirements for enterprise use, including:

Adaptability: Range of options and controls available to tune the MT system performance for very specific use cases. For example, optimization techniques applied to eCommerce catalogue content should be very different from those applied to technical support chatbot content or multilingual corporate email systems.
Data privacy and security: If an MT system will be used to translate confidential emails, business strategy and tactics documents, human evaluation requirements will differ greatly from a system that only focuses on product documentation. Some systems will harvest data for machine learning purposes, and it is important to understand this upfront.
Deployment flexibility: Some MT systems need to be deployed on premises to meet legal requirements, such as is the case in litigation scenarios or when handling high-security data.
Expert services: Having highly qualified experts to assist in the MT system tuning and customization can be critical for certain customers to develop ideal systems.
IT integration: Increasingly, MT systems are embedded in larger business workflows to enable greater multilingual capabilities, for example, in communication and collaboration software infrastructures like email, chat and CMS systems.
Overall flexibility: Together, all these elements provide flexibility to tune the MT technology to specific use cases and develop successful solutions.

Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success.

The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, output quality could vary by as much as 15% on either side of the scale without impacting the real business outcome. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.

True expressions of successful business outcomes for different use cases

Global enterprise communication and collaboration

Increased volume in cross-language internal communication and knowledge sharing with safeguarded security and privacy
Better monitoring and understanding of global customers
Rapid resolution of global customer problems, measured by volume and degree of engagement
More active customer and partner communications and information sharing

Customer service and support

Higher volume of successful self-service across the globe
Easy and quick access to multilingual support content
Increased customer satisfaction across the globe
Ability of monolingual live agents to service global customers regardless of the originating customer’s language

eCommerce

Measurably increased traffic drawn by new language content
Successful conversions in all markets
Transactions driven by new translated content
Stickiness of new visitors in new language geographies

Social media analysis

Ability to identify key brand impressions
Easy identification of key themes and issues
Clear understanding of key positive and negative reactions

Localization

Faster turnaround for all MT-based projects
Lower production cost as a reflection of lower cost per word
Better MTPE experience based on post-editor ratings
Adaptability and continuous improvement of the MT system

In upcoming posts in this series, we will continue to explore the issue of MT quality assessment from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

If you'd like to find out more, please register for our upcoming webinar on April 25.

Kirti Vashee

Independent Language Technology Consultant

Kirti Vashee is an independent language technology Consultant, specializing in machine translation and translation technology. He was also with Asia Online and was previously responsible for the worldwide business development and marketing strategy at statistical MT pioneer Language Weaver, prior to its acquisition by SDL. Kirti has long-term sales and marketing experience in the enterprise software industry, working both at large global companies (EMC, Legato, Dow Jones, Lotus, Chase) and several successful startups

All from Kirti Vashee

Understanding machine translation quality: What really matters?

Assessing business value and impact

A more meaningful evaluation framework

True expressions of successful business outcomes for different use cases

Kirti Vashee

Related Articles