Issue #6 - Zero-Shot Neural MT
As we covered in last week’s post, training a neural MT engine requires a lot of data, typically millions of sentences in both languages which are aligned at the sentence level, i.e. every sentence in the source (e.g. Spanish) has a corresponding target (e.g. English). During a typical training, the system looks at these bilingual sentence pairs and learns from it. The learning procedure makes use of the fact that we can generate a new translation and compare it with the available target to see how the model is performing and update the parameters accordingly.
In the absence of such data, this training procedure can not be followed as we don’t have any reference to compare with. What now?
PivotingTo deal with a scenario where bilingual data is not available, we often use pivoting. We translate first into an intermediate language, and from here into the target language. For example, if we need to translate from Spanish to French and we do not have Spanish-French bilingual data, we can translate from Spanish to English first, and then from English to French. The obvious issue with this approach is going to be the quality of the output. The initial output from Spanish to English will not be perfect, and consequently, this will have a knock-on effect on the quality of the French translation as the errors will percolate.
Johnson et al. (2017), proposed a multilingual NMT system, where they remove the need of such pivoting by training a single MT system on multiple languages at the same time. The system is trained like any other MT system with a slight addition in the training data. A special token indicating the target language is introduced at the start of each source sentence to inform the system about the target language it is supposed to translate. The token is added both at the training and translation time.
To get a system that can translate from Spanish to French, we can tag the source (i.e. Spanish) of Spanish-English data with English as the target language and source (i.e. English) of English-French data with French as the target language. A single NMT system is then trained with all of the data. The resulting system will be able to translate between three language pairs: Spanish to English, Spanish to French, and English to French. In the same way, we can add more languages at training time and have a many-to-many language translation system.
The system is learning to translate from Spanish to French without seeing even a single example of such language pair directly. This is known at zero-shot learning.
Zero-Shot Dual LearningWhile Johnson et al. (2017) managed to train a multilingual NMT system, the quality was very poor for a zero data scenario (e.g. Spanish to French in the above case) and worse than pivoting (roughly 20 to 50% drop in BLEU score). To improve this, Sestorain et al. (2018) proposed to use “dual” NMT (He et al, 2016). In dual NMT, two machine translation systems are trained jointly, one from the source language to the target language and another from the target language to the source language. Instead of requiring a large bilingual data, dual NMT exploits monolingual data more efficiently. The training is performed as follows:
- Train base MT systems: source-target and target-source. (e.g. Spanish-French and French-Spanish systems using multilingual NMT approach or using a small Spanish-French parallel data set).
- Train two separate language models using monolingual data, one for source and another for target language. (A language model measures the fluency of a text).
- We sample a source sentence from the source monolingual data and translate it using the source-target engine. Compute the language model score from (2).
- Translate the target sentence obtained in the previous step into source language using the target-source engine. Compute the similarity score (back translation score) between the translation obtained in this step and the source.
- Use both the back translation score and language model score to update the parameters of both machine translation systems.
How does it perform?
The dual NMT system drastically improved upon the multilingual system for Spanish to French (10.02 vs 35.54 BLEU), and French to Spanish (6.35 vs 38.83 BLEU). It was also seen that the dual NMT system performed similarly (only 2 BLEU points lower) to a system trained by extending multilingual NMT with parallel data.
He et al. (2016) carried out additional tests to mimic a real-world scenario whereby there may only be a small data set to train a “standard” NMT engine.
They observed that with dual MT, not only is it possible to have much better quality to that of a standard NMT system trained on a small data set, it can also be competitive with an engine trained on a larger data set.
|English to French||French to English|
|NMT (all data)||29.78||27.50|
|NMT (small data)||25.32||22.27|
Technically, by using a small parallel data set, the system can no longer be classified as a zero-shot MT. Nevertheless, this approach has the potential to be a game changer for certain languages where a small parallel data set is available or can be generated. However, in He et al. (2016) experiments, the “small” data set still had 1.2 million sentence pairs, which is relatively large. It would be good to establish how small the “small data” can be in the dual NMT architecture to get a decent NMT system.
None of the research we’ve referenced here had a comparison between the dual-NMT and “pivoting”. Sestorain et al. (2018) mentions “Work in progress” so we may see such evaluation in future versions.