Issue #11 - Unsupervised Neural MT

Dr. Rohit Gupta 27 Sep 2018
The topic of this blog post is low resourced languages.

Introduction

In this week’s article, we will explore unsupervised machine translation. In other words, training a machine translation engine without using any parallel data! As you might imagine, the potential implications of not needing any data to train a Neural MT engine could be huge. 

In general, most of the approaches in this direction still use some bilingual signal, for example using parallel data in related languages; pivoting; using a small parallel corpus; or a bilingual dictionary. When there is no directly parallel data to use, results are typically much worse compared to supervised methods. However, here we take a look at the technique proposed by Lample et al., 2018, which recently won the best paper award at the prestigious EMNLP 2018 Conference. This approach uses only monolingual data in both languages and still obtains a decent MT system. The performance is better than training a neural MT system with 100,000 parallel sentences. 

Cross lingual word-embeddings 

When there is no parallel data available, the first step in such scenarios is to get cross lingual word-embeddings. Such embeddings are usually obtained by training monolingual embeddings in different languages separately, and then training a mapping matrix which maps embeddings in one language to another.

Once again, when training cross lingual mappings, the approaches use a small parallel data or a small bilingual dictionary to seed the process. For example, Mikolov et al. obtained a linear mapping between source and target embedding using a bilingual dictionary of 5,000 words. However, in recent work by Conneau et al. (2018) , they leverage adversarial training to learn a linear mapping from a source to a target space without using any bilingual dictionary.

Adversarial Training

In general terms, an adversarial training is a two player game involving a generator and a discriminator (Goodfellow et al. 2014). For example, in image processing, a generator is trained to fool the discriminator by generating images close to real images and the discriminator is trained to distinguish between fake (generated by generator) and real images. In this way, the system learns to generate real looking images, which can be of human faces, cats, cars or various other objects, or even the art-work by Picasso (Tan et al. 2017). 

For our purpose, the mapping matrix can be seen as a generator. The mapper is trained to fool the discriminator by mapping source word embedding close to the target embedding. The discriminator is trained to distinguish between the mapped source (fake target) and the target. The training is carried out by randomly sampling the mapped source and real target and computing the loss of the mapper and the discriminator accordingly. 

Furthermore, because we would like to use these embeddings for machine translation, instead of words, we obtain shared sub-words (Byte Pair Encoding; Sennrich et al. 2016 ) and train the embeddings and mapping matrix on it. Using this technique a good quality mapping is obtained (63% accuracy on English-German). We create a dictionary containing frequent words using the mapping we just obtained. We then use this dictionary to obtain a better mapping by training to minimize the difference between mapped source and target, like when we have a bilingual dictionary available (Mikolov et al. 2013, Xing et al 2015 ). Conneau et al. (2018) also used a cross lingual scaling procedure to further improve the source to target mapping. The resulting embeddings have 74% accuracy on English-German.

Back Translation

Once the bilingual dictionary is in place, we can use it to make a "toy" machine translation system. Using this MT system, we can translate the monolingual data to generate a parallel data - as we discussed previously in Issue #5 on data creation for Neural MT. Once we have a parallel data, we can then use it to build a machine translation system. In addition, we can also use the same parallel data to rebuild the toy MT system which we used to generate the parallel data. The same technique can be repeated many times to improve the generated parallel data and the resulting MT system. As we have monolingual data of both languages, therefore, during training we also take language model scores into consideration. Combining back translation and language models is also explored in some previous work as we discussed earlier in our Zero-Shot MT blog where the initial system was obtained by pivoting or using some parallel data.

Will the resulting MT engine be any good?

The unsupervised engine performed much better compared to previous efforts tackling this topic. There has been on average a 10 BLEU points improvement across various language pairs. It also performed significantly better than a Neural MT model trained on only 100,000 parallel sentences. This makes it a potentially useful technique to train an MT system for resource poor languages. What remains to be done is an evaluation of how effective the translations are in absolute terms, or relative to a particular use case in practice.

In summary

In this blog, we explored the training of a machine translation system using only monolingual data from both languages. We can conclude that machine translation is quickly moving in the direction where the availability of a parallel data no longer is a deal breaker in determining whether we can have a decent MT system for a particular language pair or not. That's big!
Dr. Rohit Gupta
Author

Dr. Rohit Gupta

Sr. Machine Translation Scientist
All from Dr. Rohit Gupta