Issue #79 -Merging Terminology into Neural Machine Translation

Merging approaches
The paper by Wang et al. (2019) proposes three strategies to merge terminology into the training data:
- Tagging (T). In the training data, both the source-side terms and their target equivalent are surrounded with two markers, such as <start> and <end>. For example, a sentence pair including the terminology pair (香港, hong kong) would be written as follows.
“我 爱 <start> 香港 <end> ||| i love <start> hong kong <end>” - Mixed phrase (M). As suggested by Dinu et al. (2019), the target term phrase is appended to the source term phrase, as follows.
“我 爱 香港 hong kong ||| i love hong kong” - Tagging+Mixed phrase (T+M). The mixed phrase strategy can be combined with tagging by adding a tag between the source and appended target term phrases:
“我 爱 <start> 香港 <middle> hong kong <end> ||| i love <start> hong kong <end>” - Extra embeddings (E). This approach does not involve enriching the training corpus. It involves building an input token representation by summing an extra embedding to the word embedding and positional embedding. This extra embedding differentiates the source and target sides of a terminology pair and the other tokens. Combined with the above mixed phrase approach (M+E), the embedding would be supported by the sequence “n n s t” (s, t and n respectively refer to source and target term phrases and other tokens). This approach can also be combined with tagging+mixed phrase strategy (T+M+E). In this case the sequence supporting the embedding would be ”n n n s n t t n”.
Experiments and Results
The experiments were conducted with a Transformer model, on two different tasks: a Chinese-English corpus on the news domain with 1.25 million sentence pairs, and an English-German corpus with 4.5 million sentence pairs, in the same domain. The external bilingual pairs were named entities extracted from the training corpus with a named entity recognition system. In the T, T+M and T+M+E strategies, the test set has to be tagged in the same way as the source side of the training corpus.
On the Chinese-English task, in 74% of sentences the named entities are already translated by the baseline Transformer as in the extracted terminology database. With the T+M strategy, this number is increased up to 98,37%. With the M+E strategy, it is increased up to 94.10% and with the T+M+E strategy, up to 98,40%. Thus the extra embeddings do not bring noticeable improvements over the T+M strategy, but the M+E strategy is a softer approach which does not require modifying the training data and significantly improves the baseline. On the other hand, the advantage of the T+M strategy is that it does not require you to modify the Transformer model. On the English-German task, in 91.2% of the sentences the baseline already translates named entities as in the extracted glossary. This number increases to 99.3% with the T+M+E strategy.
Analysis of Embedding and Attention
By extracting the embeddings of the T+M+E model and by calculating their nearest neighbors (using cosine), the authors observe that when using the M strategy, the shared word embeddings gradually become cross-lingual word embeddings. For example, the nearest neighbor of “india” is “印度”, its equivalent in Chinese.
Observing the attention in different heads, the authors also remark strong connections between the same tag in different heads. This type of attention pattern is expected to help in the translation of the terms enclosed between the tags. This is what happens according to the example attention matrix shown in the paper, in which the target-side phrase is connected to the corresponding source phrase as well as to the to-copy phrase. This illustrates the role of the mixed phrase strategy.