In Issue #11 of this series, we first looked directly at the topic of unsupervised machine translation - training an engine without any parallel data. Since then, it has gone from a promising concept, to one that can produce effective systems that perform close to the level of fully supervised engines (trained with parallel data). The prospect of building good MT engines with only monolingual data has a tremendous potential impact since it opens up possibilities of applying MT in many new scenarios, particularly for low-resource languages,
The growing interest in this topic has been reflected in this blog, as this issue is actually the fourth article that deals with the idea, even dating back before Issue #11. This week, we'll give a quick recap of what has been done, before we take a look at the paper by Artetxe et. al. (2019), which proposes an effective approach to build an unsupervised neural MT engine initialised by an improved phrase-based unsupervised MT engine.
When translating in a direction in which there is no parallel training data, the usual approach was pivoting. For example, to translate from Spanish to French, pivoting used a sequence of Spanish-English and English-French engines.
In Issue #6, we talked about zero-shot NMT. Johnson et al. (2017) removed the need of pivoting with a multilingual NMT system, trained from multiple languages at the same time. For example, a system trained with only Spanish-English and English-French data was able to translate from Spanish to French. However, the resulting Spanish-French engine was relatively poor. This system was improved by Sestorain et al. (2018) with dual NMT learning He et al. (2016). In dual NMT, two machine translation systems are trained jointly, one from the source language to the target language and another from the target language to the source language. These two MT systems are improved in a loop in which the optimisation objective relies on monolingual data. A source sentence is translated into the target language, and the translation is translated back into the source language. The optimisation score is based on the fluency of the translation (in the target language) and the similarity of the initial source sentence with the translation of the translation.
In Issue #11, we took a look at the system proposed by Lample et al. (2018). This is one of the first unsupervised neural MT systems proposed (with the one by Artetxe et al., 2018). These systems have the following common points:
- They are initialised with pre-trained cross-lingual word embeddings. These cross-lingual embeddings may be trained by adversarial training or a bilingual dictionary (built manually or induced automatically). They are used to build initial MT systems in source-target and target-source directions.
- The initial MT systems are improved in a loop similar to the dual NMT one, with a training objective based on a language model (measuring fluency) score and a cycle translation score (from source to target and back to source). At the beginning of the training loop, phrase-based SMT systems are more performant than NMT systems.
In Issue #25, we saw how this approach was improved with cross-language model pre-training. In this case, not only the cross-lingual word embeddings are pre-trained, but also the whole initial encoder and decoder (Lample et. al., 2019).
The system proposed by Artetxe et. al. (2019) still initialises the system with cross-lingual word embeddings, but enhances the dual MT loop (point 2) by improving the phrase-based SMT systems used and progressively substitute these with NMT engines.
Improved Unsupervised SMT
Artetxe et. al. improve in three aspects the SMT systems used in the cycle translation loop. Firstly, they enhance the translation model with direct and inverse character-based similarity, in addition to the phrase translation and lexical weighting probabilities. This is because the initial system, based on cross-lingual word embeddings trained on the contexts in which the words appear, may mix up proper nouns appearing in similar contexts (for example: “Sunday Telegraph” and “The Times of London”). Secondly, the models are tuned with a modified procedure based on a monolingual objective (similar to the dual MT one). Thirdly, the training of phrase tables, performed with back-translated data (authentic target and synthetic source data or reversely), is improved. Source-target and target-source back-translated data are combined to allow probabilities to be estimated only on authentic data.
In this work, in the dual MT training loop, the models are updated via a synthetic parallel corpus obtained by back-translation. In the first iteration, the synthetic parallel corpus is obtained entirely with the SMT system. At each iteration, more sentences are back-translated with NMT systems until the synthetic corpus is entirely generated by the NMT engines. This progressive hybridisation yields large BLEU score improvements.
Artetxe et. al. achieve results comparable to those of Lample et. al. (2019). These results compare to the state of the art of supervised MT in 2014.
Unsupervised Neural MT is already becoming a reality, and it is now doing as well as supervised MT was doing five years ago. This opens an avenue for applying MT in scenarios in which no parallel data are available which, if you ask practitioners and enterprise users who need to support many languages, is a big deal!