Issue #158 - Improving Neural Machine Translation by Bidirectional Training
Introduction
When learning a foreign language, it may be useful for humans to have examples both from their native language into the foreign one and from the foreign language into their native one. However, this type of information is not used by current neural machine translation (MT) training pipelines. Today we take a look at a paper from Ding et al. (2021) that uses bidirectional training data to initialize a unidirectional neural MT engine.
Approach
The bidirectional training corpus is obtained by swapping the source and target sentences of a parallel corpus, and appending the swapped data to the original. In order to capture cross-lingual properties, the training is first performed with the bidirectional data for a third of the total steps. Then, to ensure that the model is correctly trained in the required direction, the training is continued for the rest of the steps on the unidirectional corpus. This approach is named by the authors as Bidirectional training (BiT).
Experiments
The impact of the proposed approach on neural MT quality is assessed on five translation datasets from IWSLT and WMT shared tasks, ranging from 140,000 to 28 million segments. The experiments are performed in both directions: English<>Spanish, English<>Romanian, English<>Swedish, and English<>German (with 4.5 and 28 million examples). Experiments are also performed on more distant language pairs: English<>Chinese (WMT17, 20 million segments) and Japanese>English (WAT17, 2 million segments). Finally, BiT is evaluated on very low-resourced settings (WMT19 English-Gujurati).
The baseline model is a Transformer Big model, or Transformer Base for WMT German-English with 4.5 million segments and English-Romanian engines.
Results
The pretraining for the first third of the total steps on the bidirectional corpus yields an improvement in BLEU score in all data sets and all translation directions, of 1.1 points on average. This improvement is significant with a p-value < 0.001 in 7 cases, and with a p-value < 0.005 in the other 3 cases. Unfortunately, only an evaluation based on BLEU score is performed.
Interestingly, the bidirectional corpus pre-training can be used for both translation directions. For example, the model obtained by training on the bidirectional corpus for the German → English direction can also be used to tune the reverse direction (English → German).
Results on more distant language pairs, such as Chinese-English or Japanese-English, also show a BLEU score improvement in all cases. The average improvement is 0.9 BLEU.
Positive results are also obtained on the low-resourced setting, with the English-Gujurati language pair. Although back-translation is not able to achieve any improvement over the baseline in this setting, BiT does. Furthermore, back-translation achieves improvement over the baseline+BiT model.
The bidirectional training approach is also combined with other popular data manipulation approaches, namely back-translation, knowledge distillation and data diversification (data diversification consists of applying back-translation and knowledge distillation on the parallel corpus). Combined with these approaches, bidirectional training still yields BLEU score improvements.
Finally, the authors claim that bidirectional training also improves word alignment, which is extracted from the attention layers. Thanks to bidirectional training, the precision is improved by 3.4% and the recall by 2.2%. The intuition behind this is that bidirectional training encourages self-attention to learn bilingual agreement.
In summary
Ding et al. (2021) propose a very simple approach in which a neural MT model is pretrained on a bidirectional corpus, and the training is continued on the unidirectional corpus. The bidirectional corpus consists of, on one side, the source segment to which the target segment is appended; on the other side, the target segment to which the source segment is appended. This approach achieves consistent and significant BLEU score improvements in a variety of language pairs and settings, including low and high-resourced pairs and combinations with backtranslation and/or knowledge distillation. Furthermore, word alignment is also improved.