Issue #62 - Domain Differential Adaptation for Neural MT
28 Nov 2019
Introduction
Neural MT models are data hungry and domain sensitive, and it is nearly impossible to obtain a good amount ( >1M segments) of training data for every domain we are interested in. One common strategy is to align the statistics of the source and target domain, but the drawback of this approach is that the statistics of the different domains are inherently divergent and smoothing over these does not always ensure optimal performance. In this post we’ll discuss the Domain Differential Adaptation (DDA) proposed by Dou et al. (2019), where instead of smoothing over the differences we embrace them.
Domain Differential Adaptation
In the DDA method, we capture the domain difference by two Language Models (LM)s, trained on in-domain (LM-in) and out-of-domain (LM-out) monolingual data respectively. Then we adapt the NMT model trained on out-of-domain data (NMT-out) producing a system as approximate to the NMT model trained on in-domain parallel data (NMT-in) without using any in-domain parallel data. In the paper, the authors proposed two approaches under the overall umbrella of the DDA framework:- Shallow Adaptation: Given LM-in, LM-out, and NMT-out, in shallow adaptation (DDA-Shallow), we combine the three models at the decoding step. Specifically, at each decoding time step t, the probability of the next generated word yt , is obtained by an interpolation of log-probabilities from LM-in, LM-out into NMT-out. Intuitively, we encourage the model to generate more words in the target domain as well as reduce the probability of generating words in the source domain.
- Deep Adaptation: This method enables the model to learn to predict using the hidden states of the LM-in, LM-out, and NMT-out. The parameters of LMs are frozen and we train only the fusion strategy and NMT parameters.
Results
The performance of the proposed methods is evaluated using German-English (de-en) and Czech-English (cs-en) models consisting of law, medical and IT domains. In experiments, they also compared their methods in three different settings:- Shallow fusion and deep fusion (Gulcehre et al., 2015)
- Copied monolingual data model (Currey et al., 2017)
- Back-translation (Sennrich et al.,2016).