Issue #56 - Scalable Adaptation for Neural Machine Translation

Raj Patel 17 Oct 2019
The topic of this blog post is domain adaptation.

Introduction 

Although current research has explored numerous approaches for adapting Neural MT engines to different languages and domains, fine-tuning remains the most common approach. In fine-tuning, the parameters of a pre-trained model are updated for the target language or domain in question. However, fine-tuning requires training and maintenance of a separate model for each target task (i.e. a separate MT engine for every domain, client, language, etc). In addition to the growing number of models, fine-tuning requires very careful tuning of hyper-parameters (eg. learning rate, regularisation, etc.) during adaptation, and is prone to rapid over fitting. This sensitivity even worsens for the high capacity (bigger size) models. In this post, we will discuss a simple yet efficient approach to handling multiple domains and languages proposed by Bapna et al., 2019. NMT 56 figure 1 Scalable Adaptation for NMT

Approach

The proposed approach consists of two phases: 
  1. Training a generic base model 
  2. Adapting it to new tasks adding small network modules 
In the first phase, a standard NMT model is trained and following convergence, all the parameters are frozen, preserving the learned information during this pre-training phase. Next, for every separate task (language/domain), adapter layers (see Fig. 1 right pane) are introduced after every layer in the encoder and decoder. The parameters of only these task-specific adapters are fine-tuned for each new language or domain, allowing us to train a single model for all tasks simultaneously.

Performance

In their paper, Bapna et al. (2019) evaluated the performance of the proposed approach with two tasks: (i) Domain adaptation and (ii) Multilingual NMT (MNMT). 

Domain Adaptation 

Using adapters for domain adaptation, they follow the two step approach:
  1. Pre-training: Pre-train the NMT model on a large open-domain corpus. Freeze all the parameters of this pre-trained model.
  2. Adaptation: Inject a set of domain-specific adapter layers for every target domain. Only these adapters are then fine-tuned to maximize performance on the corresponding domains.This step can be applied any time a new domain is added to the model.
To compare the adaptation performance of the light-weight adapter with full fine-tuning, they experimented with a large scale English-French engine. The base model is trained using WMT (36M segments) corpus and then adapted for IWSLT’15, and JRC-Acquis data sets. For both domains, the quality scores are comparable to the fine-tuning. The proposed method is even slightly better for IWSLT domain. 

Multilingual NMT 

In MNMT, the proposed adapters are used to improve the performance on the languages learnt during pre-training i.e. in contrast to the domain adaptation, we can't add new languages during the adaptation step.  The following two step approach is used for MNMT-
  1. Global training: Train a fully shared model on all language pairs, with the goal of maximising transfer to low resource languages.
  2. Refinement: Fine-tuning language pair specific adapters for all high resource languages, to recover lost performance during step 1. This step can only be applied for language pairs learned during global training.
They experiment with a large scale MNMT model (102 languages to and from English), using the multilingual corpus developed by Arivazhagan et al. 2019, containing a total of 25 billion sentence pairs and compared with a bilingual baseline. The MNMT model significantly outperforms the bilingual model for low resource languages, however, they observed significant performance deterioration for high resource languages. Using adapter based refinement, the performance on high and medium resource languages improves with a huge margin, while maintaining the performance of the low resource languages.

In summary

The proposed approach is really impressive as it allows adaption to a new domain at any point in time without affecting the existing system. On the quality front, it is comparable to full fine-tuning or bilingual baseline, and even improves in some cases. This could have big implications in term of reducing the maintenance overheads for organisations handling a large variety of content types and languages, i.e. most enterprises!
Raj Patel
Author

Raj Patel

Machine Translation Scientist
All from Raj Patel