Issue #26 - Context and Copying in Neural MT

Raj Patel 21 Feb 2019
The topic of this blog post is model improvement.


When translating from one language to another, certain words and tokens need to be copied, and not translated, per se, in the target sentence. This includes things like proper nouns, names, numbers, and 'unknown' tokens. We want these to appear in the translation just as they were in the original text. Neural MT systems with subword vocabulary are capable of copying or translating these (unknown) tokens . Studies suggest that a neural model also learns to copy a "copy-prone" source token which it has learned to translate. In this post we will try to understand the copying behaviour of Neural MT and see if contexts play any role in deciding whether to copy or translate.

Copying in Neural MT

In Neural MT, copying is more of a challenge compared to Statistical MT due to subword vocabulary and soft attention rather than strict alignment. In NMT literature, generally, copying is performed by post-processing and/or modifying the network architecture.  

Neural MT models that use the subword vocabulary to perform open vocabulary translation often translate/copy the tokens even when the full word was not seen during the training. Koehn and Knowles (2017) reported Neural MT with subword vocabulary outperforming SMT, in translating unknown tokens or copy words. This raises the question: to what extent does byte-pair encoding (a type of subword vocabulary) help Neural MT in copying, without post-processing and/or modifying the Network architecture? Answering the question, Knowles and Koehn(2018) used quantitative and qualitative evaluations (of neural models with subword vocabulary) and shed some light on what these models learn about copying tokens and about the contexts in which copying occurs.

What words are copied?

To understand copying in Neural MT, Knowles and Koehn(2018) analyzed the training data including the back-translated text. Of particular note, back-translated data contain some examples of copying that we would prefer for the system not to learn (better training), such as target language text appearing (untranslated) in the source side data. They considered a word as a candidate for copying if it appears the same number of times in both the source and target sentence. The search of copy-words was restricted to the tokens of 3 or more characters only. Further, they used part-of-speech (POS) to find the categories of the copied tokens. In the analysis of English-German (EN-DE) training data, most of the copied words were tagged as NNP (proper noun, singular) followed by CD (cardinal number) - including numbers like 42 that should be copied and numbers like ‘seven’ which should be translated. The next POS category in the list was NN (noun, singular or mass). The results were similar for DE-EN training data (tagged german side).

Context and Copying

In general, for most of the languages, there are certain contexts which indicate that the copying should occur: for example, a name preceded by the title “Mr”, “Mrs”, etc. Knowles and Koehn (2018) examined the relationship between context and copying, focusing on the left bigram context. They showed that a neural model learns that certain contexts are so indicative of copying that it'll copy (not translate) words that it has learned to translate if they occur a significant number of times in a copy-prone context. 

To analyze the copying, they collected a set of left bigram contexts for each POS. Each context-POS pair is associated with a percentage that represents how often it exhibited copying in the training data. For example (in EN-DE training data), in the copy-prone context- “thank Mrs [NNP]”, NNP was copied 91.1% of the time, compared to 15.3% of the time in “Republic of [NNP]”. To evaluate the capability of copying, they divided the context-POS pairs (from the test-set) in four categories based on two binary distinctions: observed (in the training data) or novel (not observed in the training data), and copy (typically copied) or non-copy (typically not copied). The words are considered non-copy, if they were copied ≤ 30% of the times, and as copy if they were copied ≥ 70% of the times in the training data (after filtering out those that appeared fewer than 1000 times). They then combine each word with POS-appropriate example template as follows (for NNP tag)-

Source: Therefore, Mrs Ashton, your role in this is invaluable. Tagged: Therefore, Mrs [NNP], your role in this is invaluable.

Let's say words BBC, June, and Lutreo were copied 80%, 0.8%, and 0.0% respectively in the training data.  With these words, then the following samples would be created-

Example 1: Therefore, Mrs BBC, your role in this is invaluable. Example 2: Therefore, Mrs June, your role in this is invaluable. Example 3: Therefore, Mrs Lutreo, your role in this is invaluable.

These new examples will be added in the test-set which will be preprocessed (including BPE segmentation) and then translated. 

In their experiments, they showed that the maximum copying occurs in contexts that are copy-prone. But the most interesting case is the observed-non-copy category. As these words are placed increasingly in copy-prone contexts, even these words which the system has learned it should translate are being copied.

In summary

In general, we can say that neural models learn about copying from contexts (including other factors) and the effect of contexts is strong enough to cause words that would otherwise be translated to be copied. On the surface, this is a small piece of the puzzle when it comes to understanding Neural MT. However, correct handling, translation, and preservation of specific terms and tokens is critical for many commercial applications, making this topic a lot more important than it may first seem!
Raj Patel

Raj Patel

Machine Translation Scientist
All from Raj Patel