In this week’s post, we take a look at 'context-aware' machine translation. This particular topic deals with how Neural MT engines can make use of external information to determine what translation to product - "external information" meaning information other than the words in the sentence being translated. Other modalities, for instance speech, images, and videos, or even other sentences in the source document may indeed contain relevant information that could help to resolve ambiguities and improve the overall quality of the translation. 

As such, one application of context-aware Neural MT is  document-level translation, a topic which we covered in Issue #15. We saw approaches taking several previous sentences into account when translating the source sentence. This context was modelled in a cache or as an additional neural network: an encoder or an attention network. In the latter case, the attention allows the network to focus on different words and sentences depending on the requirement, which is why it is called hierarchical attention network (HAN). In this case the document context is used on both encoder and decoder.

Some recent approaches

Recently, two new approaches have been proposed to introduce more context into the process. Jean and Cho (2019) modify the learning algorithm to add a so-called regularisation term which encourages the model to take into account useful additional context during training. Here, “useful” means that the model assigns a higher probability when relying on the additional context than when not doing so. The regularisation term works at three levels: word, sentence and entire document. Thus this approach relies not only on several previous sentences, but on the entire document context. 

Maruf et al. (2019) propose an extension of the HAN approach in which the whole document context is taken into account. In order to make it tractable, they use sparser attention weights. That is, the attention weights are non-zero for several sentences only. Sparse attention thus allows the model to selectively focus on relevant sentences in the document context and then attend to key words in those sentences. At the end, the amount of context considered is similar as in the original HAN approach, but this context can now be located at any place in the document. The document-level context representation is integrated into the encoder or decoder of a Transformer neural network depending on whether monolingual or bilingual context is used.

How does it perform?

The approach of Maruf et al. was assessed on BLEU score, on a subjective evaluation conducted by native speakers, and on a contrastive pronoun test set. This latter method consists of assessing the accuracy of translating the English pronoun it to its German counterparts es, er and sie. The proposed approach does better than the baselines under all three evaluation methods. 

We can better visualise how it works with an example: translating the word “thoughts” in the sentence my thoughts are also with the victims”. In this case, the most relevant context sentences (that is, the ones with highest attention weights) focus on phrases (the words with the highest word-level attention) like words of sympathy”, “support’, “symbol of hope” which are related to “thoughts”. This allows the engine to choose the right German translation of my thoughts” (“meine Gedanken”), while the baseline choose  "I think" (“ich denke”).

In summary

Context-aware Neural MT approaches have gone a step further and can now scale well to the entire document context. In many applications, a lot of progress is still to be done in sentence-level Neural MT, and thus taking more context into account can be considered a refinement. However in some applications the document or multi-modal context may have a more significant impact.