Issue #9 - Domain Adaptation for Neural MT

Raj Patel 13 Sep 2018
The topic of this blog post is domain adaptation.

Introduction

While Neural MT has raised the bar in terms of the quality of general purpose machine translation, it is still limited when it comes to more intricate or technical use cases. That is where domain adaptation -- the process of developing and adapting MT for specific industries, content types, and use cases -- has a big part to play. 

In this post, we take a look at some of the commonly used techniques for domain adaptation of Neural Machine Translation and summarise the survey of Chu and Wang (2017) who covered this topic in great detail. 

There are many studies of domain adaptation for Neural MT, which can be mainly divided into two categories: data centric and model centric. The data centric category focuses on the data being used rather than specialized models for the required domain. The data used can be either in-domain monolingual corpora, synthetic corpora, or parallel corpora. On the other hand, the model centric category focuses on Neural MT models that are specialized for domain adaptation, which can be either the training objective, the Neural MT architecture, or the decoding algorithm. Let’s take a closer look at each category.

Data Centric

There is a lot of research which suggests that in-domain monolingual data can be used to improve Neural MT engines, especially for low-resourced languages. Gulcehre et al. (2015) trained a Recurrent Neural Network Language Model (RNNLM) using monolingual data and infused it in the Neural MT decoder to generate the final translation. Currey et al. (2017) created a bitext corpora using monolingual data (copying the target text to the source side) and used it for training like normal Neural MT. They reported that the “copied corpus” method improves translation accuracy on named entities and other words that should remain identical between the source and target languages. This is a point that is particularly relevant for commercial translation where companies need to adhere to certain stylistic guidelines. 

It has been shown that synthetic data generation is also very effective for domain adaptation. Sennrich et al. (2016) back translated the in-domain target side monolingual data and fine-tuned the generic Neural MT model. Zhang and Zong (2016) improved the NMT-encoder by translating the source side monolingual data and using it as an additional training corpus. Similarly, Park et al. (2017) reported comparable accuracy (against a system trained on an in-domain parallel corpus) with only synthetic data, generated with pivot-based MT using both source and target side monolingual corpora.

Mixed-Domain Engines

If both in-domain and out-of-domain parallel corpora are available, it is ideal to have a mixed-domain MT engine that can improve in-domain translation keeping the quality of more general out-of-domain translation. The adaptation methods based on parallel corpora could be classified as ‘Multi-Domain’ and ‘Data Selection’. In Multi-Domain methods, the corpora of multiple domains are concatenated with two small modifications:
  1. Appending a domain tag “<2domain>” to the source sentence of the respective corpora.
  2. Over-sampling the smaller corpus so that the model pays equal attention to each domain
Using Data Selection methods for Neural MT, it is important to use NMT-related criteria rather than the one based on a language model or content overlap. Wang et al. (2017) exploit the internal embedding (output of the encoder) of the source sentence in Neural MT, and use the sentence embedding similarity to select the sentences that are close to in-domain data from out-of-domain data.

Model Centric

In Model Centric approaches, rather than focus on the data, we focus on how the engine is trained. Training functions or procedures are changed and tested to obtain an optimal in-domain MT engine. Fine tuning is the most conventional way to domain adapt in this case. Doing this, a Neural MT engine is trained using a huge amount of out-of-domain corpus until convergence, and then its parameters are fine tuned using in-domain data. One risk with this approach is that we “overfit” the in-domain data (i.e. the engine is too focused on that specific data set and cannot generalise well). 

Model ensembling is another commonly used technique in this category. This essentially means combining two or more MT engines into one. Freitag and Al-Onaizan (2016) proposes to use the ensemble of out-of-domain and the fine tuned in-domain models. The motivation behind this kind of ensembling is to prevent the degradation of out-of-domain translation after fine tuning on in-domain data. 

Constrained decoding is another evolving technique of domain adaptation for NMT. In this approach, we force an in-domain terminology to appear in the final translation.  We describe in detail one of these approaches in our previous post on applying terminology in Neural MT.

In summary

The focus in the early days of Neural MT has been on general solutions, thus the conversation around domain adaptation is relatively new. Nevertheless, it is a critically important process, especially for practical applications, and the most effective approaches will only really come to the fore with testing, and trial and error directly on these use cases.
Raj Patel
Author

Raj Patel

Machine Translation Scientist
All from Raj Patel