Issue #59 - Improving Multilingual Neural MT for unseen Languages

Raj Patel
07 Nov 2019
Issue #59 - Improving Multilingual Neural MT for unseen Languages


Recent research has shown that adapting Multilingual Neural Machine Translation (MNMT) for low resource languages (LRL) improves the performance significantly. In this post we’ll discuss a few methods based on data-selection and dynamic-adaptation, for improving the MNMT model for an unseen LRL aka zero-shot (we previously wrote about zero-shot here) translation proposed by Lakew et al. (2019)

Data Selection by Language Distance

Perplexity is commonly used to measure the quality of the language model, but has also been used to measure the distance between languages. Lakew et al. (2019) used perplexity to select high resource language (HRL) data that are similar to LRL data and measured the relatedness of the two languages. In the selection process, a language model is trained using the LRL data and select training data with the lowest perplexity from related HRL data. This method they refer to as Select-pplx and they compared this approach with the following methods-
  • Select-one: Taking all available data only from one HRL related to the LRL.
  • Select-fam: Taking all available data from a set of HRL related to the LRL belonging to the same language family.
  • Select-rand: Randomly sampling an equal proportion of data with Select-one and Select-pplx, from the HRLs that are closely related to the LRL.
In their experiments, the Select-pplx method works best compared to other methods. 


In the paper, they proposed two methods for segmenting the test data:
  • using the segmentation rules from the seed model
  • by learning the new segmentation rule with LRL data
Thus, for transfer learning they use one of the following strategies:
  • DirAdapt: All the vocabulary and learned parameters of the pre-trained model are used without any change. In this case, the segmentation rules of the pre-trained model are applied to the test language data for both the adaptation and inference stage. 
  • DynAdapt: New vocabularies are generated using new segmentation rules trained on LRL data. At the time of adaptation, if the vocabularies are already present in the current dictionary, all the relative pre-trained model weights are transferred, while a random initialization of the embedding layers and the pre-softmax linear transformation weight matrix is performed for newly inserted vocabulary items.
Results show that DynAdapt performs better compared to DirAdapt

Zero-shot Translation

In the paper, they aim to assess the potential of the large scale MNMT model for LRLs, which has never been seen at training time. This means that the transfer learning, assisting the LRL translation, is expected to come from the HRL data, particularly from related languages that are present in the pre-trained model. 

To assess the capability of the MNMT model handling zero-shot translation, they evaluated i) how pre-trained models perform before an adaptation stage on unseen test language data, and ii) how models adapted to the data, selected from related languages, affects the quality of zero-shot translation.


For their experiments they used the TED talks corpus, which is available for 58 languages aligned to English. They included Azerbaijan (az), Belarusian (be), Galician (gl), and Slovak (sk) in the LRLs, and Turkish (tr), Russian (ru), Portuguese (pt), and Czech(cs) as their HRLs respectively. To train the baseline, they used all the languages except the LRL serving as the test set.  

They compared their method with latest state-of-the-art (SOTA) results, including Data-Augment, and reported that when using DynAdapt and Select-pplx, the proposed approach outperforms the SOTA results in two of the test languages (gl and sk). The translation quality degrades marginally for az and be compared to the Data-Augment approach. 

In summary

With respect to quality scores, the proposed method is comparable to the Data-Augment approach which is considered the best method, at the moment, for improving LRLs translation quality. However, complexity-wise, the proposed method is much simpler to implement and experiment with.