Machine translation is ready for its next big steps
24 Oct 2022
6 mins
I had the pleasure of representing RWS at last week's TAUS meeting in San Jose, CA. It was good to be at the first in-person TAUS meeting since the pandemic! The community just naturally picked up where we left off, with all the friendships and the wonderful exchanges of ideas and opinions. The progress of machine translation and AI has continued, despite the pandemic and there was some very exciting progress to review!
From an RWS perspective, the highlights of the conference fell into three main themes:
Massively multilingual is here. This was the official theme of the conference. We definitely see this within RWS as well; for example, with the latest language chaining functionality having dramatically increased the number of available Language Weaver machine translation (MT) languages. A stellar group of MT gurus walked the conference participants through the evolution of MT away from bilingual models (language pairs) toward very large models with many languages, where the neural network calculates similarities to derive translations. Some even see convergence into a single, massive model with thousands of languages. To achieve this, researchers are using very large data sets - over 25 billion training samples is typical. The vision is that there could one day be one giant model that would serve any language in any style, trained on multiple tasks and with multiple data sources, to be ready for a variety of production tasks requiring artificial intelligence (AI).
Multimodal is coming. We saw practical applications of speech-to-text progress in presentations on online meeting translation offerings, and amazingly, these are now feeling mainstream! Where the envelope is being pushed is "multi-modal: incorporating text, speech and video together into translation models is already well advanced in research, both in academia and among the very large MT suppliers. We've seen pioneers exploring AI for sign language translation AI, but now researchers are also studying how data from vision, facial expressions, speech, and body movement can be added to any text to help increase translation quality. There is an emerging consensus that including both language and modalities, as well as cross-domain learning, in MT training will help move us closer to the day that AI can learn to speak, hear, see, and understand the way humans do. Even animal language is being considered in this light!
Low-resource languages benefit from massively multilingual models. We heard continued concerns about a relative lack of data for MT in low-resource languages. The good news is researchers are finding that, as you combine and consolidate separate models, it's not only languages with a lot of data that benefit - low-volume languages start to benefit as well. Meta made a persuasive case for their No Language Left Behind vision of open-source data sets for MT training, that would encourage collaboration and speed progress for all languages, including low-resource ones. A worry expressed about the open-source data model is that it de facto incorporates societal biases, thus aggravating growing social gaps. There is awareness of the problem, and an intent to do something to minimize it; at this point mostly via human review and annotation.
Finally, we heard that along with the importance of data for training machine translation, human feedback on MT quality will remain key for improving MT models. MT customization remains important in enterprises. It could become increasingly important and widespread for all applications, if we can find ways to make it an easier and more automated process. At RWS we agree, and our support for customers taking charge of their own customization via adaptable language pairs is part of this effort.
RWS translation management systems easily integrate with a variety of machine translation engines, including out-of-the-box availability with Language Weaver MT. The RWS R&D team is the recipient of over 50 patents, and is at the forefront of government and commercial usage of highly secure, highly customizable MT, in the cloud and on-premise. Click here to learn more.