The impact of LLMs on translation: a new impetus for evolving machine translation

Bart Maczynski VP of Machine Learning, Solutions Consulting 14 Nov 2023

7 mins

Continuous innovation

Earlier this year I wrote an article about the the three waves of technological innovations that had shaped the translation industry. The three waves were, perhaps unsurprisingly, Translation Memory (TM), Translation Management Systems (TMS), and Neural Machine Translation (NMT). Each of these technologies introduced new efficiencies into the translation process: TM helped reduce time and effort spent on translating repetitive text, TMS allowed for centralization of translation assets and optimization of workflows, and NMT provided first draft translations so that net new content did not have to be translated from scratch.

One of the most interesting aspects of these three innovations is how interlocked they are. The Translation Memory technology made translators so efficient – and the databases of approved translations so valuable – that, to fully coordinate the tasks assignments for the former, and optimize the leverage of the latter, a new solution category had to be introduced. And, while TM created the space for the TMS, TMS made TM better by amplifying its leverage across multiple teams, vendors, and content types. Similarly, Machine Translation made TM more valuable, since it can now be used not only for direct leverage but also as a source of high-quality training input for adaptive MT models.

Nevertheless, already in January when I wrote the article, it was clear that another tech wave was coming. A new type of natural language processing engine, called the Large Language Model, was being introduced to a diverse, global community of users, for the first time not limited to specialists like NLP researchers or engineers. ChatGPT, released by OpenAI in November of 2022, was another technological innovation. It wrapped the GPT Large Language Models into a simple, intuitive interface, built around the concept of a natural conversation – and thus made them accessible to anyone who can read and write. What has followed is a rapid proliferation of many different LLMs, like Google Bard, Anthropic Claude, TII Falcon, Meta’s LLAMA etc.

What we are experiencing now is an accelerating AI revolution, driven primarily by the emergent capabilities of Large Language Models, leading to their seemingly universal application, and accompanied by the inevitable yet somewhat overwhelming hype. I don’t have precise data but it’s safe to assume the current user base of ChatGPT to be well over 200 million – across all industries and enterprise functions. On top of that, there are users of competitive platforms like Google Bard or Bing Chat, and users of open source LLMs.

With this scale of adoption – or perhaps we should say experimentation - it is not surprising that all sorts of ideas, use cases, hopes, and concerns can be heard. In almost any industry I can think of, AI has had its advocates and its opponents, representing sometimes extreme views, from “LLMs are the modern oracles of knowledge” to “LLMs are just stochastic parrots that generate convincing noise but cannot comprehend anything”. Entertaining as these discussions may be, once we tune them out, the question that arises is that of practical, real-life applications, with the requirements dictated by the specific industry and use case. How can this unfolding AI revolution be harnessed to evolve current solutions to achieve better outcomes?

Unintended consequences

In the translation industry, initial experiments with LLMs cover a wide spectrum of use cases, from generating domain-specific content, through terminology management, gender bias correction, TM data cleanup, register or style adjustment, all the way to straight up translation. For some experimenters, that last idea is the most appealing. After all, Large Language Models are so much larger than purpose-built MT models that they must perform better, right? Well, that type of blue-sky thinking is great, but it’s not useful if we cannot, pardon my pun, translate it into practical change - and the unforeseen consequences can often lead to disasters. I recall one such early example where a customer tried to replace a dedicated neural MT system with a Large Language Model, hoping to take advantage of its general knowledge and wide context window to achieve more relevant and consistent translations without human intervention. The customer shared some of their experiments with us and, upon closer examination, we discovered instances of what I would call sycophantic translation, where the model produced translated text that it assumed would be most acceptable, even if it did not accurately reflect the source text. In one such example, the source segment contained a reference to a product that belonged to our customer’s brand. The LLM did not know how to translate that product name into the target language so it chose a different strategy: it correctly identified the product category, picked a similar product from a different manufacturer (for which it did know the target language name), and inserted it into the translated sentence. And thus, even though the model’s general knowledge and wide context window played a crucial role in generating the translation, the outcome was not exactly the one that our customer had anticipated or hoped for. That whole adventure reminded me of a short story by Stanisław Lem that I had read in grade school. In the story (from Lem’s Cyberiad collection) Trurl the engineer builds a machine that can create anything starting with the letter N. And all is going well until his friend, Klapaucius, asks the machine to create Nothing, whereby the situation very quickly accelerates towards apocalyptic outcomes.

I think the lesson for us is twofold: for one, consider your use case carefully. Understand what new technologies, such as LLMs, are best at, and apply them where you can improve the outcomes without undermining the entire process. Understand the challenge you are solving, define the current frontier, where the existing solutions underdeliver. Second, if you are introducing the innovation into an established workflow, process, or platform, make sure you retain the same level of governance as you do for the rest of the solution. For LLMs, this has initially been the bigger of the two challenges. Enterprise grade translation solutions require levels of security, controllability, reliability and customization that are difficult to achieve if you are wrapping someone else’s API into your own application, especially if at the core of the solution are models that you are unable to govern. And if the data that’s expected to be processed by your solution is not yours but has been entrusted to you by your most valuable customers, the stakes are very high.

The current frontier

In the translation industry, the current frontier –where off-the-shelf solutions cannot quite deliver improvements yet, is in the continuous high-scale need for human intervention. Granted, in the last few decades, there has been tremendous progress in this field as I mentioned in the beginning. But now the role of a professional linguist has changed significantly: their tasks are focused on post-editing and reviewing machine output. This is why the leading voices in the industry call for the functional transition from the role of a translator to that of a language specialist.

As machine translation advances become more customizable to specific domains, content types, and use cases, the focus increasingly turns to two key tasks: First, identifying which sections of the translation may require refinement. Second, leveraging this insight to concentrate efforts on those areas to achieve the desired enhancements. If this observation resonates with you, the pressing challenge becomes: how can we evolve these two tasks into a more automated process? How can we build a technological solution that advances translation to surpass its current limits?

Introducing Evolve

It is exactly this question that we asked ourselves a while ago at RWS. How can we build a system that would incorporate the best of past and current innovations to automate the post-editing process and thus bring the new wave of innovation into the industry? Emergence of Large Language Models, and the research on language models in general, help deliver the answer.

Earlier, I briefly mentioned BERT and early GPT technologies as examples of language models. An interesting thing about them is that their specific neural architectures made them well suited for particular categories of tasks, and these early language models have paved the way for the technologies available today. If you look at their names, you’ll notice that both BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) have one element in common – the Transformer. Transformers are a type of neural network architecture that's specially designed to handle sequential data, like text or time series. Unlike previous models that processed data in order, transformers can look at all parts of the data at once, which allows them to understand complex relationships and contexts much faster. They do that using a mechanism called "attention," which helps the model focus on different parts of the data depending on what it's currently trying to achieve, such as translating between languages, summarizing a paragraph, or generating text based on a prompt. Transformers have revolutionized the field of natural language processing and are the backbone of many modern AI systems used in the field. In fact, while the first Neural Machine Translation systems were based on the Recurrent Neural Network (RNN) architecture, the current crop of NMT has increasingly utilized transformers since their introduction. The transformer model, with its attention mechanisms and ability to process entire sequences simultaneously, has proven to be highly effective for the complexities of translation. This architecture allows NMT systems to better capture nuances and context, leading to more accurate and fluent translations across various languages.

Nevertheless, while the original BERT and GPT models both utilized the transformer architecture, they also differed in significant ways. The E in BERT stands for encoder, while GPT is primarily a decoder-based architecture. In advanced language models, the encoder is a component that performs deep linguistic analysis. It examines the input text to discern its meaning, structure, and the relationships between words and phrases, effectively encoding the essence of the input into a complex, abstract representation. The decoder is the generative counterpart that interprets this abstract representation. It predicts the most likely sequence of words that follows, based on patterns it has learned during training. It doesn't just repeat what it's seen; it generates new content that's contextually and syntactically coherent.

While these two components can work in tandem, as in a sequence-to-sequence model used for tasks like translation, certain models specialize in one aspect. GPT is such a model, using only the decoder part for text generation tasks, whereas BERT utilizes the encoder part to understand and process input text for tasks that require a deep understanding of language, such as question answering, named entity recognition, or quality estimation.

This leaves us with an exciting field of possibilities. We have encoder/decoder models, such as Neural MT engines, we have encoder models that can tell us things about the input text, and we have decoder models that can generate text. You can probably see where this is going - we can use three different architectures and optimize them for different tasks: one for translation, one for text analysis, and one for text generation. What if we put them all together, so that we can automatically translate the input text, then automatically detect the areas that need improvement, and then automatically rewrite the flagged sections to improve them?

This is precisely what we have done with our next generation capability in Language Weaver Evolve. It combines three AI-driven technologies to address the challenge of MT post-editing. The three components are:

Neural Machine Translation with auto-adaptive language pairs - this technology from Language Weaver has already proven itself on the market. It is optimized to provide relevant translations in a secure and scalable manner across the required language combinations. It is also capable of learning from external inputs on a continuous basis. These inputs may include translation memory data, bilingual dictionaries, and real-time feedback provided by post-editing. The Language Weaver language pairs can also be trained upfront by customers who have relevant bilingual content.
Machine Translation Quality Estimation (MTQE) – based on a language model, this automatic estimation engine is designed to automatically detect and flag lower quality translations. Interestingly, in our implementation, it can do that both on the document and the segment level - but for Evolve we are focusing on automatically marking each translated sentence as good, adequate, or bad - so that we know where to focus the improvement efforts.
Finally, once we know what needs to be improved, the 3rd component comes into place. It is an automated post-editing engine, based on a Large Language Model (LLM), which we host securely using the same infrastructure we use for our MT and MTQE services.

Instead of sending the poor and adequate sentences directly to human linguists, we give the machine a chance to improve them - and iterate the edits until we get a better score. The system re-runs the MTQE process after each automatic edit to test if the translation has improved.

The approach we have taken with Evolve has a few interesting advantages:

All translation work is conducted using the dedicated enterprise-grade NMT models which are optimized for high-quality and high-scale applications, while offering reasonable compute requirements and low total costs of ownership. This technology has been successfully used by large user communities and is deployed across hundreds of commercial and public sector clients.
The quality estimation models have been calibrated using human-labeled examples, using our in-house expert linguistic teams. This allows us to tune the performance of the model and extend coverage to new languages as needed.
The automated post-editing service utilizes a dedicated, smaller LLM hosted by RWS. This allows us to tune the LLM performance, provide the highest levels of data security, and operate within a predictable cost structure. It is also invulnerable to any 3rd-party API instabilities.
Building the solution from three separate modules - translation, quality estimation, and post-editing, allows us to tweak not only the individual components but also how they work together. For example, Language Weaver can now iterate the evaluate/edit task loop several times until a desired outcome is achieved. When an edit task is completed, the translation is sent back for quality estimation – if the result is still found inadequate, the sentence is propagated to the post-editing module again. This time, however, the system captures additional context from the source document and uses it to generate a better translation. (So far, our tests have shown that allowing up to three iterations provides the best compromise between quality, speed, and cost for most types of content).
Evolve can be used in all use cases where traditional MT is used because it does not change the ways in which translations are consumed by external systems and workflows. Crucially, in the localization use case, where some degree of human intervention may still be required (or mandated, as it’s the case for a lot of regulatory content), Evolve can seamlessly integrate into current workflows to alleviate the post-editing burden presently shouldered by human linguists.
Finally, since Language Weaver keeps track of all the automated edits and estimation results, the by-product of the translate/evaluate/edit sequence is a fantastic source of feedback for the translation engine. The auto-adaptable language pairs monitor the incoming edits and automatically update their models to reflect the observed improvements.

You too can evolve with us

Optimizing the task of post-editing is a major opportunity for all involved in the translation process, from enterprise customers, through language service providers, to individual linguists. Using a combination of auto-adaptive MT and LLMs to minimize manual post-editing effort allows limited resources to be prioritized for high value-added activities. It also increases the usefulness of automated translation in use cases with minimal room for human intervention – or where time-to-market or time-to-insight is the primary driver - for example the high-volume use cases related to legal eDiscovery, regulatory compliance, or digital forensics. For localization processes, the solution helps improve the ROI through significant productivity gains. And for organizations that want to benefit from adaptable MT models but cannot because they don’t have enough previously translated material – Language Weaver Evolve is a great option to jump-start their translation process and initiate a virtuous improvement cycle.

Click here to learn if you qualify to benefit from Evolve.

Bart Maczynski

VP of Machine Learning, Solutions Consulting

Bart is VP of Machine Learning at RWS.

All from Bart Maczynski