Issue #138 - Data Diversification - A Simple Strategy For NMT

Akshai Ramesh
08 Jul 2021
Issue #138 - Data Diversification - A Simple Strategy For NMT


In recent years, the Neural Machine Translation (NMT) models have achieved great success and become the new mainstream method in practical MT systems. Despite the great success, NMT is far from perfect and there has been a wide range of research to identify and address the various shortcomings of NMT models. Today we will focus on one such research area - the generalisation of the NMT models. In Issue #22 of the blog series, we covered a model-centric approach to incorporate diversity in the model. In today’s blog post, we will take a look at the work of Nguyen et al., 2020 who propose a novel data-centric approach to diversify the training corpus without the use of monolingual data and improve machine translation consistently and significantly.


In order to make the NMT models more powerful and efficient, there has been a lot of research in two directions:
  1. Model-centric approach that focuses mainly on modifying the model architecture. This approach comprises the advancement from traditional recurrent models (Sutskever et al., 2014) to attention-based Transformer (Vaswani et al., 2017). 
  2. Data-centric approach that doesn’t involve any modification to the model architecture and relies on the synthesized data. Back-translation (Sennrich et al., 2016a) and the Copied-corpus method (Currey et al., 2017) are some examples of data-centric approaches.
The advantage of the second approach is that it is not model-dependent and can be applied/adapted to future architectural developments with small modifications.
The Data Diversification strategy falls under the second category and is inspired by multiple well-known strategies: back-translation, ensemble of models, data augmentation, and knowledge distillation for NMT.

The Approach 

For a given parallel corpus D = (S,T) where S and T are the source-side and target-side corpus, the strategy involves carrying out N number of iterations, with each iteration comprising of 2 stages - training and translation, and during each iteration i, k number of models are trained in both forward (M1S→T,i ,...,MkS→T,i) and backward (M1T→S,i,..,MkT→S,i) directions.
During the translation stage of each iteration, the forward models (M1S→T,i ,...,MkS→T,i) are used to translate the source-side original corpus S to generate the synthetic target-side corpus and similarly the backward models (M1T→S,i,..,MkT→S,i) are used to translate the target-side original corpus T to generate the synthetic source-side corpus.
At the end of each iteration, all the generated synthetic data is concatenated with the original parallel corpus and this concatenated data, Di is passed as input to the next iteration. At the end of N iterations, the final model MS→T is trained on the augmented corpus DN.
The strategy can be represented in the form of an algorithm, as shown below:

Experiments and Results

Datasets: The proposed approach is evaluated on both high- and low-resource translation tasks across eight different language directions using translation datasets of different sizes, including WMT’14 {English (En) -> German (De), English (En) -> French (Fr)}, IWSLT’14 English (En) <-> German (De), IWSLT’13 English (En) <-> French (Fr) and low-resource setup proposed by Guzmán et al. for English (En) <-> Nepali (Ne) and English (En) <-> Sinhala (Si). 
Results: The approach achieves state-of-the-art results in the WMT’14 English-German and English-French translation task with 30.7 and 43.7 BLEU, respectively. There Are also significant improvements of 1.0-2.0 BLEU in the IWSLT’14 English <-> German and IWSLT’13 English <-> French tasks.
The model trained using the data diversification approach also seems to outperform the baseline results in the low-resource tasks across all 4 language directions English <-> Sinhala and English <-> Nepali. 


  • The proposed approach shows a strong correlation with the ensemble of models. But unlike the ensemble of models approach, the proposed approach doesn’t require additional computations and parameters at inference time.
  • The data diversification strategy sacrifices perplexity for better BLEU scores and yields better generalisation in the translations.
  • When the forward and backward model translations are used individually for diversification, it can be seen that forward diversification outperforms backward diversification. 
  • The proposed approach doesn’t suffer from the translationese effect and from experiments, it can be seen that this approach can be used complementary to the semi-supervised back-translation approach.

In summary

Nguyen et al., 2020 propose a simple strategy to diversify the training corpus that doesn’t require any additional data or modification to the model architecture. The proposed approach diversifies the training corpus by making use of the synthetic parallel data generated using multiple forward and backward model predictions on the original source-side and target-side corpus. It achieved state-of-the-art results in the WMT’14 English-German and English-French translation task with 30.7 and 43.7 BLEU, respectively. The proposed approach also shows significant improvements across 8 other translation tasks comprising both high- and low-resource training corpus. From the experiments, it can be seen that the proposed approach outperforms the other related methods - knowledge distillation and dual learning, and the method is also complementary to the back-translation approach.