Issue #25 - Improving Neural MT with Cross Language Model Pretraining

Dr. Rohit Gupta 14 Feb 2019
This blog post is about language modelling.


One of the reasons for the success of Neural MT, and deep learning techniques in general, is the more effective and efficient utilization of large amounts of training data without too much overhead in terms of the time it takes to infer, and the size of the resulting models. This also opens the door to using additional related data or models to improve results for a specific task or application. 

In natural language processing (NLP), pre-training on related languages or similar tasks is one of the ways to improve model performance. The quantum of benefit depends on the approach and the available data. Recently, Lampe et. al. (2019) developed a cross-lingual language modelling technique where they pre-trained the models to improve various NLP tasks. In this week's article, we will discuss how it is applied and how it brought significant improvements to supervised and unsupervised Neural MT.

Causal (traditional) Language Model Training

Causal language model (CLM) training is performed by processing text left to right (or right to left for some languages) and predicts the next token in the sequence. Therefore, the training utilizes the context only from one side. This is the way traditional language models are trained and is a typical approach for language model training.

Mask Language Model Training

Recently, Devlin et al. (2018) proposed the mask language model training approach where we can utilize the context from both sides. In mask language modeling (MLM), the training data is preprocessed and some percentage of tokens are randomly replaced by a MASK token. The model tries to predict the original word corresponding to the MASK token using both sides of the context surrounding the MASK token. For both CLM and MLM, transformer network was used for training. We covered MLM in more detail in our previous post.

Cross Lingual Language Model

Cross Lingual Language Model (XLM) is a technique where one language model can be used to obtain scores for multiple languages. The training is performed by combining data from multiple languages. To train XLM, Lampe et. al. (2019) first learn a joint BPE (sub-word tokens) using text from all languages in the task. The training is performed for multiple languages, however, within one mini-batch (consisting of 256 tokens) for training, the stream of sentences are sampled from the same language. Both CLM and MLM cross lingual models are trained in this manner.

Supervised Neural MT

Supervised neural MT is an approach where we use parallel (sentence aligned) data as the main input to train the system. When additional monolingual data is available, which is typically the case, we also utilize it. Back-translation is one of the ways we use additional monolingual data and improve supervised neural MT. In back-translation, we generate parallel data by translating monolingual data using a system trained in the opposite direction. Therefore, if we would like to improve our German-English system, we will train an English-German system and translate the additional monolingual English data into German. We augment the generated parallel data with the existing data and train a new German-English system. 

The supervised neural MT can also be improved using XLM. The encoder and decoder of a supervised neural MT can be pre-trained with XLM. When used in conjunction with the back translation, such a system obtains 3.1 BLEU improvements using CLM and 4.4 BLEU improvements using MLM over a baseline system, which uses the back-translation but does not use the XLM initialization. Therefore, the monolingual data is utilized in a better way with the XLM pre-training.

Unsupervised Neural MT

In issue #11 and issue #6 of our Neural MT Weekly series,  we talked about the ability to build a Neural MT system without large parallel data. The MT models were initialized using a bilingual dictionary or with a small parallel data. The further training of the MT models were carried out in an iterative dual learning fashion by using language modeling scores and back-translation scores. Lampe et al (2019) observed that we can improve the unsupervised neural MT if we initialize the encoder and decoder of the unsupervised MT system with XLM pre-training. We obtain 2 to 5 BLEU points improvements using CLM, and 6 to 9 BLEU points improvement using MLM over the previous state of the art unsupervised MT on French, German and Romanian into/from English.  

In summary

It appears safe to say that XLM initialization is an effective tool to push Neural MT quality even further. The improvements obtained with such pre-training are significant. It is highly likely that not only research, but production systems as well, will see updates of a similar nature in the near future.
Dr. Rohit Gupta

Dr. Rohit Gupta

Sr. Machine Translation Scientist
All from Dr. Rohit Gupta