As has been covered a number of times in this series, Neural MT requires good data for training, and acquiring such data for new languages can be costly and not always feasible. One approach in Neural MT literature for improving translation quality for low-resource language is transfer-learning. A common practice is to reuse the model parameters (encoder, decoder, and word embeddings) of a high resource language and fine tune it to a specific domain or language. In this post, we take a look at a new concept of dynamic vocabulary, improving transfer-learning in Neural MT.
Transfer Learning in NMT
Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data and reducing the training time. In Neural MT, research has shown promising results when a transfer-learning technique is applied to leverage existing models to cope with the scarcity of training data in specific domains or language settings. In a broader sense, pre-trained models have been successfully exploited and reported to improve the translation quality to a great extent.
Zoph et al. (2016) used a parent-child setting in which they trained a “parent” model with a large amount of available data. Then the encoder-decoder components are transferred to initialise the parameters of a low-resourced “child” model. In their experiments, they kept the decoder parameters of the child model fixed at the time of fine-tuning. They reported an average BLEU improvement of 7.5 for Hausa, Turkish, Uzbek, and Urdu (into English) using French as the parent language. Nguyen and Chiang (2017) further extended the parent-child approach and analysed the effect of using related languages on the source side. They exploited the overlap of vocabulary in parent and child languages and reported further improvements on top of Zoph et al. (2016).
Using Multilingual architecture is an evolving technique of transfer-learning in Neural MT. Dong et al. (2015) proposed a multilingual architecture (one-to-many) that utilises a single encoder for the source language and separate attention mechanisms and decoders for each target language. Ha et al. (2016) and Johnson et al. (2016) proposed a simple yet efficient many-to-many architecture of multilingual NMT. Ha et al. (2016) applied a language-specific code to words from different languages in a mixed-language vocabulary. Johnson et al. (2016) used a language flag (prepending to the start of the segment) to the input string, eliminating the need of having separate encoder/decoder networks and an attention mechanism for each new language pair. They trained NMT models of up to twelve language pairs with better translation quality compared to individual pairs.
Using dynamic vocabulary is one of the latest additions in transfer-learning approaches of NMT. In this approach, instead of directly using the vocabulary of parent model it is updated according to the required domain and language. Lakew et al. (2018) extended Johnson et al. (2016) and explored the effect of using dynamic vocabulary for the following scenarios-
- The new data in terms of domains or languages emerge over time (most real-world MT training scenarios fall in this category)
- All the training data for all the language pairs are available since the beginning
In the first case, they used the intersection of Vp, and Vc whereas replacing Vp entries with Vc if the entries of the former vocabulary do not exist in the latter (here Vp and Vc represent the vocabulary of parent and current models respectively). In the second scenario, they kept the Vp intact and added new entries of the new domain in Vc. At the time of training, these new entries are randomly initialised, while the intersecting items maintain the embedding of the parent model.
In their experiments, they used a many-to-many multilingual NMT architecture with dynamic vocabulary and compared the accuracy with two baseline models- 1) A multilingual NMT system trained with in-domain data only, 2) A multilingual parent model fine-tuned with in domain data using parent vocabulary (Vp). They reported a significant improvement using dynamic vocabulary ranging from 3.0 to 13.0 BLEU points, compared to their baseline.
Dynamic vocabulary is a rather new but promising add-on to the current transfer-learning techniques of Neural MT. These significant improvements over the baseline models in Lakew et al. (2018) suggest that dynamic vocabulary is a viable setting in multilingual NMT architecture and provides developers with another option when it comes to limited training data scenarios.