Issue #15 - Document-Level Neural MT
Introduction
In this week’s post, we take a look at document-level neural machine translation. Most, if not all existing approaches to machine translation operate on the sentence level. That is to say, when translating a document, it is actually split up into individual sentences or segments, and they are processed independently of each other. With document-level Neural MT, as the name suggests, we are going beyond sentence level translation, to take into account some surrounding sentences and context during the translation of any particular sentence.
Why do we need document level MT?
The benefits of document level translations are clear. If we can take the whole document into account, like human translators do, we will get better coherence, cohesion, consistency and terminology selection. Currently, even when we send a document or paragraph to an MT system, it is internally split into individual sentences and these sentences are translated independently. When we translate sentences without looking at the surrounding text we lose information regarding the context and the ability to resolve ambiguous cases. For example, the same pronoun in the source language can be translated differently in the target language depending on which noun it is referring to, but we might only have that information somewhere else in the document.The translation of a Hindi sentence “वह एक व्यापारी है (Wah ek vyapari hai)” can be “She is a businesswoman” or “He is a businessman” depending on the gender. “वह (wah)” is a pronoun and can refer to both “he” or “she”.Some promising approaches
Recently, Neural MT research moved in this direction and we have some interesting work to share. Let's take a look at three recently proposed approaches:
- Using an Additional Encoder (Hierarchical RNN): Having an additional neural network for the previous session of the task was introduced in Sordoni et al. (2015). They modeled previous queries in a separate session-RNN to provide better suggestions for the current query being searched by a user. Similarly we can model previous source text as a document-context to better translate the current sentence. Wang et al. (2017) proposed a similar approach where they modeled document-context as an additional encoder (based on a recurrent neural network). They considered the previous three sentences to model the document context.
- Cache-based Memory Network: In cache based memory network (Tu et al. 2017), a cache consisting of relevant information from previous translations is used to obtain the document information. When a new sentence is translated, the cache is updated. A new word is added by replacing the least recently used word and if a word is already there in the cache, it is updated with the average of the current and the previous stored value. The cache size can vary from 25 to 500 words. The approach can be used with various neural networks and improves both RNNs and Transformer models.
- Hierarchical Attention Network (HAN): Similar to the additional encoder approach, Hierarchical Attention Network (Miculicich et al. 2018) also uses additional network to model information from previous input text. However, by using attention, the network gets the ability to focus on different words and sentences depending on the requirement. Also, it uses document-context on both encoder and decoder and takes into account both source text and the previous translations. Using HAN, this approach saw its best results when considering the previous three sentences for the document-context.
How did it perform?
Wang et al. (2017) obtained +2 BLEU points improvement over the baseline RNN model. In a document-level evaluation, they reported that they fixed 29 out of the total 38 ambiguity errors and 24 out of 32 inconsistency errors present in a Chinese-English test set.
Compared to the baseline transformer model, Miculicich et al. (2018) obtained around 0.3 to 1.0 BLEU points improvement using cache based approach and around 0.9 to 1.4 BLEU points improvement using HAN on various Chinese-English and Spanish-English test sets. They also evaluated noun translation accuracy, pronoun translation accuracy, lexical cohesion and coherence and obtained some improvements in all four measures using HAN.
In summary
All three of the approaches discussed here obtain better translations over the respective baseline networks without document information. This meets the intuition that it is good to include some document-context during machine translation and avoid throwing away the relevant information coming from the context.
In the past with Statistical MT, document-level translation wasn't really computationally practical, but that's changed with Neural MT. For the time being, mainstream research will continue to focus primarily on sentence-level approaches because there's still room for improvement there. However, as it gets better, document-level will come more into focus to optimise the output for certain use cases and content types.l