Issue #25 - Improving Neural MT with Cross Language Model Pretraining
Introduction
One of the reasons for the success of Neural MT, and deep learning techniques in general, is the more effective and efficient utilization of large amounts of training data without too much overhead in terms of the time it takes to infer, and the size of the resulting models. This also opens the door to using additional related data or models to improve results for a specific task or application.
In natural language processing (NLP), pre-training on related languages or similar tasks is one of the ways to improve model performance. The quantum of benefit depends on the approach and the available data. Recently, Lampe et. al. (2019) developed a cross-lingual language modelling technique where they pre-trained the models to improve various NLP tasks. In this week's article, we will discuss how it is applied and how it brought significant improvements to supervised and unsupervised Neural MT.
Causal (traditional) Language Model Training
Causal language model (CLM) training is performed by processing text left to right (or right to left for some languages) and predicts the next token in the sequence. Therefore, the training utilizes the context only from one side. This is the way traditional language models are trained and is a typical approach for language model training.Mask Language Model Training
Recently, Devlin et al. (2018) proposed the mask language model training approach where we can utilize the context from both sides. In mask language modeling (MLM), the training data is preprocessed and some percentage of tokens are randomly replaced by a MASK token. The model tries to predict the original word corresponding to the MASK token using both sides of the context surrounding the MASK token. For both CLM and MLM, transformer network was used for training. We covered MLM in more detail in our previous post.Cross Lingual Language Model
Cross Lingual Language Model (XLM) is a technique where one language model can be used to obtain scores for multiple languages. The training is performed by combining data from multiple languages. To train XLM, Lampe et. al. (2019) first learn a joint BPE (sub-word tokens) using text from all languages in the task. The training is performed for multiple languages, however, within one mini-batch (consisting of 256 tokens) for training, the stream of sentences are sampled from the same language. Both CLM and MLM cross lingual models are trained in this manner.Supervised Neural MT
Supervised neural MT is an approach where we use parallel (sentence aligned) data as the main input to train the system. When additional monolingual data is available, which is typically the case, we also utilize it. Back-translation is one of the ways we use additional monolingual data and improve supervised neural MT. In back-translation, we generate parallel data by translating monolingual data using a system trained in the opposite direction. Therefore, if we would like to improve our German-English system, we will train an English-German system and translate the additional monolingual English data into German. We augment the generated parallel data with the existing data and train a new German-English system.
The supervised neural MT can also be improved using XLM. The encoder and decoder of a supervised neural MT can be pre-trained with XLM. When used in conjunction with the back translation, such a system obtains 3.1 BLEU improvements using CLM and 4.4 BLEU improvements using MLM over a baseline system, which uses the back-translation but does not use the XLM initialization. Therefore, the monolingual data is utilized in a better way with the XLM pre-training.