Issue #44 - Tagged Back-Translation for Neural Machine Translation
Note from the editor: You may have noticed that our posts have been a little more technical in nature over the last few weeks. This is reflective of a broader trend in R&D whereby higher level topics and "breakthroughs" have been covered, and scientists are now drilling down to optimise existing approaches. This can be seen again in today's on the topic of synthetic data. However, over the coming weeks and months, we will also be looking at Neural MT from a few different angles, with some guest posts, use cases, and, of course, a preview and review of the upcoming "Machine Translation Summit" which is taking place literally 5 minutes from Iconic HQ in Dublin, Ireland next month. Stay tuned!
In this week’s post we take a look at some new research which changes the outlook for back-translation in Neural MT. Back-translation (BT) consists of adding synthetic training data produced by translating monolingual data in the target language into the source language (in this way no noise is introduced in the target language).
In a previous post (#16), we saw that back-translated data was beneficial up to a certain point, after which adding more became harmful. We also saw that it was more beneficial if selected in a way that improves the diversity of the training data. In a recent paper, Caswell et al. (2019) state that the diversity is a way to distinguish synthetic data from original data, and the same gains can be obtained by simply adding a tag to the synthetic data.
The methods used to give more diversity to synthetic data are adding noise or sampling (translating by allowing any output instead of always the most probable one). Both methods achieve similar improvements with respect to standard back-translation (always outputting the most probable hypothesis). Caswell and al. simply append a tag at the beginning of the translated sentence and compare this technique to noised BT. They noised the translated data by removing 10% of the words, replacing another 10% by underscores and permuting words up to 3 words away from their original position). Examples of tagged or noised translated data, or both (the original, i.e. target-language sentence is not changed), are as follows:
[no noise] Raise the child, love the child. Noised BT Raise child ___ love child, the. Tagged BT <BT> Raise the child, love the child. Tagged Noised BT <BT> Raise, the child the ___ love.
The experiments were performed with a state-of-the art transformer NMT engine on the English-German task proposed at the Conference on Machine Translation (WMT). The average on several test sets are reported. The engine is trained on 5 million original parallel segments, plus 24 million back-translated parallel segments.
The engine trained on only original data achieves a BLEU score of 32. Adding standard BT data yields an improvement of 1.1 BLEU (3.4%). Noised, Tagged and Tagged Noised BT data achieve respectively an improvement of 4.8%, 5.1% and 4.2% with respect to standard BT. Thus the simple addition of tag achieves a slightly higher gain than noising the translated data.
The authors also noise the original training data and check, without BT, that it does not have a dramatic impact on the BLEU score (0.7% loss with 20% of the training data noised, and 4% loss with 80% noised). In this scenario, the noise can no longer be used to distinguish BT data from original data. Thus we expect the tagged BT variant to perform better than the noised BT variant in a similar margin to the noised BT in the clean original data scenario. This is confirmed by the results (tagged BT yields a 4.6% improvement over noised BT in the noised original data scenario).
However, the authors did not mention the potential gains of selecting monolingual data to be back-translated if they contain words difficult to translate, as mentioned in our post #16. This can be an interesting future work.