Word alignments were the cornerstone of all previous approaches to statistical MT. You take your parallel corpus, align the words, and build from there. In Neural MT however, word alignment is no longer needed as an input of the system. That being said, research is coming back around to the idea that it remains useful in real-world practical scenarios for tasks such as replacing tags in MT output.
Conveniently, current Neural MT engines can extract word alignment from the attention weights. Unfortunately, its quality is worse than external word alignments produced from traditional approaches of SMT, because attending to the context words rather than the aligned source words might be useful for translation. In this post, we take a look at two papers proposing a method which improves word alignment extracted by the Transformer models in Neural MT.
Although both papers report results with the Alignment Error Rate metric (AER), which has been proved to be inadequate to measure alignment error (see Fraser and Marcu, 2007 or Lambert et al., 2005), their analysis and ideas are still very interesting (and actually, the second paper also reports precision and recall for some of the experiments).
On the Word Alignment from Neural Machine Translation (Li et al, 2019)
Li et al. propose two methods to induce word alignment in a neural MT engine. The first one consists of introducing an explicit word alignment model. The disadvantage of this method is that it requires previously aligned data to supervise training.
The second method induces word alignment by prediction difference. The intuition is that if a source word x is aligned to a target word y, x is much more relevant to y than to the other target words.Thus the probability to predict a target word if x is present or not present in the source varies more for y than for the other target words. This difference is calculated by enabling or disabling the connection between x and the encoder network. This is inspired from dropout, for which a percentage of words are randomly disconnected from the network to limit over-fitting.
With both methods, word alignment is clearly better than the one extracted from attention weights. However, both are worse than the statistical word alignment inferred by FAST ALIGN. To explain this last result, Li et al. distinguish between target words most contributed from source (CFS, such as content words) and target words most contributed from target (CFT, such as function words, which may not be aligned to any source word and depend more on neighbouring target words). They find that neural MT captures better alignment for CFS words than the alignment for CFT words, and FAST ALIGN generates much better alignment than NMT for CFT words. The poor performance of NMT alignment of CFT words is related to another conclusion of the paper: word alignment errors of CFS words are mainly responsible for translation errors instead of CFT words. Thus a good alignment of CFT words by the NMT system is not critical.
Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)
Garg et al. improve the word alignment induced by a neural MT system by multi-task learning: in addition to the MT loss function, they use a loss function aiming at optimising alignment quality. This loss function calculates the difference between the induced alignment and a provided alignment. In the transformer model, in order to attend to different words at the same time, several sets of attention weights are estimated, each set being called an attention head. Since the authors observed that the alignment extracted from the attention probabilities of the penultimate layer are the best, they arbitrarily selected one head from the penultimate layer and introduced the alignment loss function only in that head.
The provided alignment can be induced by averaging attention weights over the penultimate layer. Using multi-task learning on three different language pairs, the AER was improved by 3 to 7%. A further 5% improvement is achieved by letting the alignment loss function access the context of the whole target sentence, while the attention sub-layer in the decoder attends to the representations of only the past tokens computed by the lower layer. Providing word alignment produced by the GIZA++ statistical alignment tool instead of the ones induced by layer average, an improvement of 3-4% is achieved. This last alignment outperforms GIZA++.