What are the different approaches to Neural MT? Since its relatively recent advent, the underlying technology has been based on one of three main architectures:
- Recurrent Neural Networks (RNN)
- Convolutional Neural Networks (CNN)
- Self-Attention Networks (Transformer)
For various language pairs, non-recurrent architectures (CNN and Transformer) have outperformed RNNs but there has not been any solid explanations as to why. In this post, we’ll evaluate these architectures for their ability to model more complex linguistic phenomena, such as long-range dependencies, and try to understand if and how it contributes to better performance.
Contrastive Evaluation of Machine Translation
BLEU is used as a standard metric to evaluate the quality of the translation (we won't open that Pandora's box today!), but it can't explicitly evaluate the translation with respect to a specific linguistic phenomenon eg. subject-verb-agreement and word sense disambiguation (both happen implicitly during machine translation). In literature, contrastive translations are the most common approach used to measure the accuracy of a model with respect to various linguistic phenomena.
Contastive translations are created by introducing a specific type of error/noise in human translation. An example of a contrastive pair in the subject-verb agreement category :
English: [...] plan will be approved.
German: [...] Plan verabschiedet wird.
Contrast: [...] Plan verabschiedet werden.
Here, the word “wird” is replaced with “werden” to generate the contrastive translation. To create contrastive translations for word sense disambiguation, given an ambiguous word in the source sentence, the correct translation is replaced by another meaning of the ambiguous word which is incorrect.
In contrastive translation evaluation, it exploits the fact that Neural MT models are conditional language models and can produce a probability p(T/S) where S, and T are source and target sentences respectively. If a model assigns a higher probability to the correct target sentence than to a contrastive variant that contains an error, we consider it a correct decision. The accuracy of the model in this test scenario is simply the percentage of cases where the correct target sentence is scored higher than all contrastive variants.
CNN and Transformer models can connect distant words via shorter network paths compared to the RNN. And it has been speculated that this improves their ability to model long-range dependencies, resulting in better translation systems. However, this theoretical argument has not been tested empirically. The subject-verb agreement is the most popular choice to evaluate the ability of modeling long-range dependencies.
Sennrich, (2017) evaluated the RNN model for its ability to learn long-range dependencies. He compared a character-based model with the one based on subword-units and reported better morphosyntactic agreement with the latter. Tran et al. (2018) compared RNN and Transformer with respect to their ability to model hierarchical (syntactic) structure and found that recurrency is indeed important for this purpose. Tang et al. (2018) did an empirical analysis of RNN, CNN, and Transformer models comparing their performance with respect to subject-verb agreement and concluded that Transformer and CNN models do not outperform RNNs in modeling long-range dependencies. They also showed that the number of attention heads in Transformer model impacts their ability to capture long-range dependencies.
Word Sense Disambiguation (WSD)
According to Tang et al. 2018, CNN and Transformer are not better at capturing long-range dependencies compared to RNNs, even though the paths in CNN and Transformer are shorter. However, these architectures perform well empirically according to BLEU. Thus, to test their hypothesis that non-recurrent architectures are better at extracting semantic features resulting in better translation quality, they further evaluate these architectures on WSD (where semantic feature extraction is required) task. In the paper, they reported that the Transformer model strongly outperforms the other architectures (CNN and RNN) on WSD task, with a gap of 4-8 percentage points. This affirms their hypothesis that Transformers are strong semantic feature extractors resulting in better translation systems.
According to the recent reports, there is no evidence that CNN and Transformer, which have shorter paths through networks, are empirically superior to RNN in modeling subject-verb agreement over long distances. However, Transformer models outperform CNN and RNN architectures at WSD, affirming that they are strong semantic feature extractors, implicitly contributing to their strong performance.
The implications of this are important. It harks back to a question as to whether it's possible to have a once-size-fits-all approach to machine translation (editor: it's not!). These findings suggest that there may be a case to apply different Neural MT architectures to different challenges, be it language, content, or style.