Issue #117 - Subword Segmentation and a Single Bridge Language Affect Zero-Shot Neural Machine Translation
Introduction
Nowadays, zero-shot machine translation is receiving more and more attention due to the expensive cost of building new engines for different language directions. The underlying principle of this strategy is to build a single model that can learn to translate between different language pairs without involving direct training for such combinations. Following the first zero-shot approach proposed by Firat et al. (2016), several recent studies (Johnson et al., 2017; Gu et al., 2019; Zhang et al., 2020 and Arivazhagan et al., 2019) also achieved some improvement in translation quality of zero-shot language pairs.
This research inspired us to ask a question: does this strategy always deliver promising and stable performance? In some of our earlier posts (#6, #37 and #40), we discussed the zero-shot neural machine translation (NMT) approach and its application for domain adaptation. In this post, we take a look at a paper by Rios et al. (2020) which evaluates the behavior of the multilingual model proposed by Johnson et al. (2017) for zero-shot language directions under different preprocessing and data settings.
Does the zero-shot approach always deliver stable performance?
To evaluate the stability of this strategy, they first build an English-centric baseline using 5 million parallel sentences from WMT (Barrault et al., 2019) per language direction (English <> {French (fr), Czech (cs), German (de), Finnish (fi)}). For zero-shot language pairs (de-fi, de-fr and cs-fr), they sample three test sets from OPUS (Tiedemann, 2012), 2 thousand parallel segments for each. The data is prepared following Johnson et al. (2017): Initial tag on the source side to specify the target language; byte-pair encoding (BPE) model trained jointly on the mix data in five languages and all the systems are base Transformers. According to the experiment results from three training runs, the BLEU scores on zero-shot language pairs are quite unstable: the scores vary up to 6.28 points with a standard deviation of 3.14.Impact of vocabulary overlap from different languages
One of the hypotheses from this paper is that the model behaviour heavily relies on the subword vocabulary that is shared between languages. To verify this idea, two models are trained:
(a) a model trained with language-specific subword segmentation without vocabulary overlap by adding a language tag to each subword (e.g. the subword token in in English should be in#en# instead of in);
(b) a model trained in a similar way as the first one but with vocabulary overlap, so no language tag is needed in this case.
According to the results, the model (a) only achieves 4.7 BLEU score as a lot of English subwords marked with #en# appear in the translation, but without considering language tag, the score is 12.7 which is still worse than the baseline model trained on joint BPE (15.4 BLEU). However, model (b) delivers a much better result 20.5 BLEU which is even better than the baseline.
Multi-Bridge Models
One of the big challenges for zero-shot machine translation is output in an incorrect language. This is typical for English-centric models since for all other languages, English is the only target language that they are directly related to during the training phase. To alleviate this issue, they proposed to add a small amount of parallel data in other language pairs without involving English so that the model can be more sensitive to the language tags. To evaluate this hypothesis, a small amount of parallel data in German-Czech and Finnish-French (around 350K parallel sentences per pair) from Rapid2016, NewsCommentary and GlobalVoices are used to train their baseline model. The results show that even a small amount of non-English parallel data increases generalisation to unseen translation directions (+3.1 BLEU on average).Improve zero-shot language pairs with Back-Translation and Encoder Alignment
Besides zero-shot language experiments under different data conditions, this paper also presents their studies that combine back-translation and encoder alignment with the zero-shot approach. For the back-translation experiment, they create synthetic corpora for all zero-shot directions using back-translation (250K parallel sentences per direction) and fine-tune the models with the concatenation of this data, plus 250K sentence pairs per supervised translation direction. For the encoder alignment experiment, they implement cosine loss following Gouws et al. (2015) which average encoder states instead of normalising sequence lengths by max pooling. A new parameter ƛ is introduced in their case, it scales cosine distance (CD) loss with respect to the standard cross-entropy (CE):L=(1-ƛ)*CE+ƛ*CD
According to the results, both back-translation and encoder alignment improve the performance on zero-shot language directions: back-translation leads to an average improvement of 0.7 BLEU with single-bridge data, and 1.2 BLEU with multi-bridge data; with encoder alignment, they obtain an increase of 0.8 BLEU for single-bridge data and only 0.3 BLEU for the multi-bridge scenario.