Issue #79 -Merging Terminology into Neural Machine Translation

The topic of this blog post is terminology.
After several years being the state of the art in Machine Translation, neural MT still doesn’t have a convenient way to enforce the translation of custom terms according to a glossary. In issue #7, we reviewed several approaches to handle terminology in neural MT. Just adding the glossary to the training data is not effective. Replacing the source term by a placeholder, and then the placeholder by the glossary translation is easy. However, it is a hard decision, since it doesn’t allow the model to discard the glossary translation in some contexts. It also does not allow the model to fully decide the position of the term translation in the target sentence. Adding constraints in decoding significantly reduces translation speed, which makes it unsuitable for production engines. In this post we take a look at a paper which considers strategies that help enforce the custom terminology translation, without forcing the model to make any hard decisions, and with no impact on translation speed.

Merging approaches

The paper by Wang et al. (2019) proposes three strategies to merge terminology into the training data:

  • Tagging (T).  In the training data, both the source-side terms and their target equivalent are surrounded with two markers, such as <start> and <end>. For example, a sentence pair including the terminology pair (香港, hong kong) would be written as follows.
    “我 爱 <start> 香港 <end> ||| i love <start> hong kong <end>”
  • Mixed phrase (M). As suggested by Dinu et al. (2019), the target term phrase is appended to the source term phrase, as follows.
    “我 爱 香港 hong kong ||| i love hong kong”
  • Tagging+Mixed phrase (T+M). The mixed phrase strategy can be combined with tagging by adding a tag between the source and appended target term phrases:
    “我 爱 <start> 香港 <middle> hong kong <end> ||| i love <start> hong kong <end>”
  • Extra embeddings (E). This approach does not involve enriching the training corpus. It involves building an input token representation by summing an extra embedding to the word embedding and positional embedding. This extra embedding differentiates the source and target sides of a terminology pair and the other tokens. Combined with the above mixed phrase approach (M+E), the embedding would be supported by the sequence “n n s t” (s, t and n respectively refer to source and target term phrases and other tokens). This approach can also be combined with tagging+mixed phrase strategy (T+M+E). In this case the sequence supporting the embedding would be ”n n n s n t t n”.

Experiments and Results

The experiments were conducted with a Transformer model, on two different tasks: a Chinese-English corpus on the news domain with 1.25 million sentence pairs, and an English-German corpus with 4.5 million sentence pairs, in the same domain. The external bilingual pairs were named entities extracted from the training corpus with a named entity recognition system. In the T, T+M and T+M+E strategies, the test set has to be tagged in the same way as the source side of the training corpus. 

On the Chinese-English task, in 74% of sentences the named entities are already translated by the baseline Transformer as in the extracted terminology database. With the T+M strategy, this number is increased up to 98,37%. With the M+E strategy, it is increased up to 94.10% and with the T+M+E strategy, up to 98,40%. Thus the extra embeddings do not bring noticeable improvements over the T+M strategy, but the M+E strategy is a softer approach which does not require modifying the training data and significantly improves the baseline. On the other hand, the advantage of the T+M strategy is that it does not require you to modify the Transformer model. On the English-German task, in 91.2% of the sentences the baseline already translates named entities as in the extracted glossary. This number increases to 99.3% with the T+M+E strategy.

Analysis of Embedding and Attention

By extracting the embeddings of the T+M+E model and by calculating their nearest neighbors (using cosine), the authors observe that when using the M strategy, the shared word embeddings gradually become cross-lingual word embeddings. For example, the nearest neighbor of “india” is “印度”, its equivalent in Chinese. 

Observing the attention in different heads, the authors also remark strong connections between the same tag in different heads. This type of attention pattern is expected to help in the translation of the terms enclosed between the tags. This is what happens according to the example attention matrix shown in the paper, in which the target-side phrase is connected to the corresponding source phrase as well as to the to-copy phrase. This illustrates the role of the mixed phrase strategy.

In summary

This paper proposes effective strategies to enforce the translation of terms according to a glossary, without forcing the model to make hard decisions, and without reducing the translation speed. Two types of approach are considered, which can be used in isolation or combined. The first type only requires tagging the training data and the test sentences, and the second type only requires slightly modifying the Transformer model. This paper thus seems to be a nice step towards a convenient way of handling terminology in Neural MT.
Dr. Patrik Lambert

Dr. Patrik Lambert

Senior Machine Translation Scientist
Patrik conducts research on and builds high-quality customized machine translation engines, proposes and develops improved approaches to the company's machine translation software, and provides support to other team members.
He received a master in Physics from McGill University. Then he worked for several years as technical translator and as software developer. He completed in 2008 a PhD in Artificial Intelligence at the Polytechnic University of Catalonia (UPC, Spain). He then worked as research associate on machine translation and cross-lingual sentiment analysis.
All from Dr. Patrik Lambert