Issue #37 - Zero-shot Neural MT as Domain Adaptation
        
      
  
Introduction
Zero-shot machine translation - a topic we first covered in Issue #6 - is the idea that you can have a single MT engine that can translate between multiple languages. Such multilingual Neural MT systems can be built by simply concatenating parallel sentence pairs in several language directions and only adding a token in the source side indicating to which language it should be translated. The system learns how to encode the input into a vector representation for several different languages, and how to generate the output conditioned on the encoder representation. This configuration enables zero-shot translation, that is the translation in a language direction not seen in training.
However, so far zero-shot translation has been much worse than translation via a pivot language. In this post, we take a look at a paper which analyses why zero-shot translation is not working and proposes effective solutions by considering a new source language as a new domain.
The missing ingredient
Arivazhagan et al. (2019) build a multilingual neural MT system from English into French and German and reversely, with English-French and English-German parallel data. Then they use this system to translate from French into German and reversely, although no training data was available between these languages. The idea behind this is that the multilingual engine learns how to encode French, German or English input into a vector representation, and learns how to decode this representation to generate text in the target language. Thus it can, in theory, decode into a target language seen in training with another source language. However, this doesn’t work very well because the decoder learns to generate the target text conditioned on the encoder representation. Thus for zero-shot translation to work well, the encoder representation should be language-independent. In other words, it should be like an interlingua.Aligning Representations
To make the encoder representations more similar between each other, and thus more language-independent, Arivazhagan et al. use domain adaptation techniques. They consider that different source languages are like different domains, and different target languages are like different tasks. Taking English as the source domain, the aim is to adapt the other domains (languages) to English. To this end they introduce a regularisation term optimised during training which minimises the discrepancy between the feature distributions of the source and target domains. This will force the model to make representations of sentences in all non-English languages similar to their English counterparts.
In the paper two regularisers are tested. The first one minimises the discrepancy between the feature distributions of the source and target domains by explicitly optimising an adversarial loss (in Issue #11 we had a look at adversarial training). Thus it aims at aligning distributions and does not require parallel training data. The second regulariser benefits from the available parallel training data. It maximises the similarity between the encoder representation of the English and the target sides of a segment pair.
Both regularisers yield a large BLEU score improvement of zero shot translation, up to the level of pivot translation. The BLEU score increases from 17 to 26 for French to German, and from 12 to 20 for German to French.
