Issue #19 - Adaptive Neural MT
Neural Machine Translation is known to be particularly poor at translating out-of-domain data. That is, an engine trained on generic data will be much worse at translating medical documents than an engine trained on medical data. It is much more sensitive to such differences than, say, Statistical MT. This problem is partially solved by domain adaptation techniques, which we covered in Issue #9 of this series. However, what if we are in a multi-domain scenario and we do not know the nature of the input in advance?
In this post, we take a look at a technique proposed by Amin Farajian et al. (2017), which consists of adapting the model on-the-fly for each source sentence with similar sentence pairs retrieved from the training corpus. This technique actually also works very well as micro-adaptation in a closed-domain scenario.
Adaptive adaptation
A standard domain adaptation method for Neural MT models consists of training a model on a large amount of generic data and resuming the training on in-domain data, usually available in a much smaller amount. Farajian et al. apply this technique updating the model on-the-fly for each source sentence, with a few (or even only one) in-domain sentence pairs. These sentence pairs are retrieved from the training corpus by information retrieval techniques based on the similarity of the source side with the source sentence.
The authors argue that the higher the similarity, the more useful are the retrieved sentence pairs for adaptation. On the contrary, a retrieved sentence pair that is very different from the source sentence cannot be considered in-domain and is not expected to be useful for adaptation. To benefit from this, the amplitude of the model update steps, controlled by hyper-parameters like the number of iterations and the learning rate, is adjusted dynamically for each sentence. The parameter values in each similarity range were determined by plots of the BLEU score gain versus each of these parameters. The results confirmed the intuition that the learning rate and the number of iterations should be proportional to the similarity.
Promising initial results
This technique allows us to perform adaptation to any domain, provided sentence pairs similar to the source sentence can be found in the training data. It thus yields small improvements on generic tasks in which sentences are usually not very similar, but huge gains in specific domains with repeated constructions. In the paper, Farajian et al. report 0.3 BLEU points improvement with respect to the non-adapted model on the generic WMT task (mostly news), and more than 10 BLEU points on specific tasks such as ECB (financial), JRC (legal) or KDE4 (IT). The on-the-fly technique also outperforms, in most domains, the model adapted with all available in-domain data.
The main drawback of the method is a reduction of translation speed, especially on CPU. The retrieval of similar sentence pairs in the training corpus is fast, but adapting the Neural MT model on CPU can take up to 1 or 2 seconds for each sentence.