Revving up the Right MT Engine for Your Program
In a recent blog post, we ran through what it takes to execute a machine translation program, from assessing what content works best for MT to deploying an MT engine.
But choosing the right engine for your projects is a part of the process that merits more detailed discussions. Considering the many choices available on the market—dozens at last count—what features should you look for? How do you whittle down your options?
As we said in our last post, not all engines can handle all types of jobs. So, it’s important that your language service provider (LSP) works with you to clearly define your project requirements up front, especially quality expectations.
They should also ask you to provide a corpus of content that consists of source texts and their corresponding translations done by professional linguists. The source content becomes part of the test set that is run through an MT engine. Then, human-translated and machine-translated content is compared to determine the quality of the MT engine’s output.
So, how do you figure out the best engines to try? Here’s how we at RWS Moravia approach the selection process.
1. Check what’s deployable
Identifying which engines to try is called technical limitation. First, we check which engines your translation management system (TMS) can actually support. (We skip this step if you don’t have a TMS.)
But there are a few other things to consider among MT engines. We look at features like:
- Engine type: The two major engine types these days are neural machine translation (NMT) and statistical machine translation (SMT). Research suggests that NMT engines, although pretty new to the scene, perform better than the best SMT engines—and are improving by the day. So, while SMT is still used in some cases, big players like Google and Microsoft are quickly rolling out NMT.
- Data privacy: Not all MT engine providers guarantee the security of your training data and may keep it as part of theirs, which some of our clients aren’t comfortable with. Other providers, like Microsoft and Google, explicitly decline to use data for any purpose other than your own when you use their paid services.
- “Baseline” framework: Most MT engine providers have a “baseline” engine: an out-of-the-box, generic engine that hasn’t yet been “trained” with brand styles or terminologies; it simply pulls from freely available data on the web. Other providers like Globalese offer a “clean-slate” solution with an engine that has to be trained with your specific content.
So, that means you have three options:
- Use a clean-slate engine if you want to start from scratch, which requires a lot of your own training data.
- Train a baseline engine with your content to add your brand style to generic language, which requires less training data. This is the most viable option and the one we prefer. Most of our clients have enough training data to adapt a baseline engine—which generally gets better results than using the baseline engine as-is—but not enough to start from scratch.
- Use a generic engine as-is, which requires no training data.
With all these considerations in mind, we’re able to narrow the choices down to five or six MT engines that best fit your needs. Now, it’s time to run the tests to see how the engines behave.
2. Run automated evaluations
There are two main evaluation methods for the quality of an MT engine’s output: automated and human. We always start with the former.
Among the most common methods for automated evaluation, and one of the metrics we use, is the BLEU (bilingual evaluation understudy) score: an algorithm used to compare the similarity of machine and human translations. The higher the score, the better the engine’s quality and the closer it is to human translation. Logically, BLEU scores tend to be higher when there’s a good amount of training data.
The maximum BLEU score is 100, but given that no two humans even translate the same way, a score of 75 is very high. Our benchmark is around 50: engines that produce lower scores are disqualified because they likely won’t improve translation efficiency. Those that pass with a score of 50 or higher can be tested further.
3. Train the engines
At this stage, we start training the engines to handle industry- or market-specific terminology using data from your translation memory (TM).
What if you don’t have a TM—perhaps because you’re entering a new market?
In that case, we start with the best available generic (baseline) engine for that language. After some human post-editing (more on that below), the data can be used for training.
We’ve also started experimenting with new ways to get training data. For example, TAUS’s Matching Data service allows us to upload source content that the system will analyze, then download training data from TAUS’s database that’s relevant to your content sample and target language.
After training the engines, we again run your source content through and use automated evaluations to see which ones come out on top. That should leave us with two or three MT engines that are ready for human evaluation.
4. Run human evaluations
Here’s where we use our expertise and prior experience to select the best engine (or combination of engines) for your content. First, we analyze the quality of the raw MT (the MT translations straight out of the engines that haven’t been reviewed or edited by humans).
Next, we perform post-editing (PE) and measure its effectiveness. (Post-editing is a different skill than translating.) At RWS Moravia, we use a proprietary tool to compare raw and post-edited machine translations using metrics like the number of edits or the time it took a post-editor to edit the raw MT. The lower the effort, the better the engine.
At this point, after compiling results from our automated and human evaluations, one MT provider usually stands out above the rest.
5. Choose engine(s) and pilot them
You might choose more than one engine for different purposes or languages. For example, Google might work better for Chinese and Microsoft for French. One engine might be better suited to post-editing and another to publishing raw machine translation, if raw MT is enough for a particular content type.
Either way, once you’ve chosen your engine(s), we start officially training them by adding more training data to produce better results. This isn’t an exact science, since it depends on each client’s unique subject matter, content and language pairs. Modifying an engine through training data requires some trial and error.
Once the engine is ready, we can run a pilot translation project. Again, results can go either way. An engine that got a high BLEU score might end up inadequate for your specific purpose after all, meaning we need to switch MT providers or redo the training. But if the pilot works, great! We put the engine into production.
Final thoughts
There are plenty of choices available to you once you decide that machine translation suits your program. By working with an LSP, you can leverage their extensive experience to narrow down the list of contenders and identify the engines most likely to fit your content, scope and budget.
The “challenge” (though not for us MT geeks who love experimentation) is the process of trial and error. Even with a good idea of how an engine will perform, we need to track and monitor MT performance over time to be sure that it actually does. There are no shortcuts to finding the best solution for you.
We know there’s a lot here to digest. RWS Moravia can guide you through every step of selecting and deploying your MT engine to get you the best results for your business and meet your customers’ needs worldwide.