Issue #19 - Adaptive Neural MT

Dr. Patrik Lambert 29 Nov 2018
This blog post is about domain adaptation.

Neural Machine Translation is known to be particularly poor at translating out-of-domain data. That is, an engine trained on generic data will be much worse at translating medical documents than an engine trained on medical data. It is much more sensitive to such differences than, say, Statistical MT. This problem is partially solved by domain adaptation techniques, which we covered in Issue #9 of this series. However, what if we are in a multi-domain scenario and we do not know the nature of the input in advance? 

In this post, we take a look at a technique proposed by Amin Farajian et al. (2017), which consists of adapting the model on-the-fly for each source sentence with similar sentence pairs retrieved from the training corpus. This technique actually also works very well as micro-adaptation in a closed-domain scenario.

Adaptive adaptation

A standard domain adaptation method for Neural MT models consists of training a model on a large amount of generic data and resuming the training on in-domain data, usually available in a much smaller amount. Farajian et al. apply this technique updating the model on-the-fly for each source sentence, with a few (or even only one) in-domain sentence pairs. These sentence pairs are retrieved from the training corpus by information retrieval techniques based on the similarity of the source side with the source sentence. 

The authors argue that the higher the similarity, the more useful are the retrieved sentence pairs for adaptation. On the contrary, a retrieved sentence pair that is very different from the source sentence cannot be considered in-domain and is not expected to be useful for adaptation. To benefit from this, the amplitude of the model update steps, controlled by hyper-parameters like the number of iterations and the learning rate, is adjusted dynamically for each sentence. The parameter values in each similarity range were determined by plots of the BLEU score gain versus each of these parameters. The results confirmed the intuition that the learning rate and the number of iterations should be proportional to the similarity.

Promising initial results

This technique allows us to perform adaptation to any domain, provided sentence pairs similar to the source sentence can be found in the training data. It thus yields small improvements on generic tasks in which sentences are usually not very similar, but huge gains in specific domains with repeated constructions. In the paper, Farajian et al. report 0.3 BLEU points improvement with respect to the non-adapted model on the generic WMT task (mostly news), and more than 10 BLEU points on specific tasks such as ECB (financial), JRC (legal) or KDE4 (IT). The on-the-fly technique also outperforms, in most domains, the model adapted with all available in-domain data. 

The main drawback of the method is a reduction of translation speed, especially on CPU. The retrieval of similar sentence pairs in the training corpus is fast, but adapting the Neural MT model on CPU can take up to 1 or 2 seconds for each sentence.

In summary

The on-the-fly adaptation technique with dynamic hyper-parameter setting seems like it could be an effective solution in a multi-domain scenario. In this work, we saw impressive results in technical domains with high similarity between sentences, when using publicly available "in-domain" data sets. It will be interesting to see how this fares on real-world use cases where domain adaptation is required.
Dr. Patrik Lambert
Author

Dr. Patrik Lambert

Senior Machine Translation Scientist
Patrik conducts research on and builds high-quality customized machine translation engines, proposes and develops improved approaches to the company's machine translation software, and provides support to other team members.
He received a master in Physics from McGill University. Then he worked for several years as technical translator and as software developer. He completed in 2008 a PhD in Artificial Intelligence at the Polytechnic University of Catalonia (UPC, Spain). He then worked as research associate on machine translation and cross-lingual sentiment analysis.
All from Dr. Patrik Lambert