Issue #40 - Consistency by Agreement in Zero-shot Neural MT

Raj Patel
06 Jun 2019
Issue #40 - Consistency by Agreement in Zero-shot Neural MT

Introduction

In two of our earlier posts (Issues #6 and #37), we discussed the zero-shot approach to Neural MT - learning to translate from source to target without seeing even a single example of the language pair directly. In Neural MT, the zero-shot training is achieved using multilingual architecture (Johnson et al. 2017) - a single NMT engine that can translate between multiple languages. The multilingual neural model is trained for several language directions by concatenating the parallel sentences of various language pairs. 

In this post, we focus on the generalisation issue of zero-shot neural MT and discuss a new training method proposed by Al-Shedivat and Parikh (2019).

Zero-shot consistency

A neural MT engine is said to be ‘zero-shot consistent’ if low error on supervised tasks implies low error on zero-shot tasks i.e. the system generalises. In general, it is better to have a translation system that exhibits zero-shot generalisation as the access to the parallel data is always limited and training is computationally expensive. 

To achieve zero-shot consistency in Neural MT, Al-Shedivat and Parikh proposed a new training objective for multilingual NMT called ‘agreement-based likelihood’ that avoids the limitations of pure composite likelihoods. The idea of agreement-based learning was initially proposed for learning consistent alignment (Lianget al., 2006) in phrase-based statistical machine translation (SMT).

Agreement-based likelihood

Rather than jumping into the full details of the objective function, for simplicity, let’s consider a multilingual NMT model of 4 languages -- English (En), Spanish (Es), French (Fr), and Russian (Ru) -- where we have available parallel corpora for En-Es, En-Fr, and En-Ru. Intuitively, the objective is the likelihood of observing parallel sentences (XEn, XFr) and having sub-models P0 (Z|XEn) and P0 (Z|XFr) agree on all translations into Spanish and Russian at the same time where Z = {Es, Ru}.

Does it work?

Al-Shedivat and  Parikh experimented using UN corpus for En, Es, Fr, Ru, Europarl v7 for German (De) En, Es, Fr, and IWSLT17 for Italian (It), Dutch (Nl), Romanian (Ro), De, and En. They focus their evaluation mainly on zero-shot performance of the following methods:
  • Basic, which stands for directly evaluating a multilingual model after standard training (Johnson et al., 2017).
  • Pivot, which performs pivoting-based inference using a multilingual model (after standard training); often regarded as the gold-standard.
  • Agree, which applies a multilingual model trained with agreement objective directly to zero-shot directions.
In the experiments with UN corpus, they reported that the models trained with agreement perform comparably to Pivot, and outperform it in some cases, e.g., when the target is Russian, perhaps because it is quite different linguistically from the English pivot. On the Europarl dataset, the models trained with proposed objective consistently outperform the Basic models by 2-3 BLEU points but lag behind the Pivot based systems. In IWSLT17, because of the large amount of data overlap and presence of many supervised translation pairs, the vanilla training method (Johnson et al., 2017) achieves very high zero-shot performance, even outperforming Pivot. The models trained with agreement give a slight gain over these strong baseline systems.

In summary

The zero-shot generalisation in the context of multilingual neural MT, not only improves the zero-shot translation but also preserves the quality of supervised translation. Pivot based systems are still better, but that also involves translating twice (source-to-pivot and pivot-to-target).