Issue #149 - Adversarial Attack on Neural MT
An Empirical Study on Adversarial Attack on NMT: Languages and Positions Matter
Despite the big success in Artificial Intelligence research and applications, Szegedy et al. (2014) and Goodfellow et al. (2015) show that Deep Neural Networks are still vulnerable to small perturbations. Neural Machine Translation (NMT) is not an exception, noisy perturbation is also challenging for MT models. Some recent research (Cheng et al. 2018; Zhao et al. 2018 and Ebrahimi et al. 2018) investigates the impact of adversarial attacks on NMT models, but most previous approaches focus only on injecting perturbations into one side. In this post, we take a look at a paper by Zeng and Xiong (2021) which presents an empirical study on different adversarial attacks with NMT models to investigate which attack is more effective for NMT and the corresponding impacts. Additionally, they propose a new adversarial attack method based on attention distribution.
In this work, the base model is implemented using Transformer architecture. In general, their adversarial attack and training framework are based on Cheng et al. (2018), but instead of training from scratch, they pretrain their NMT models before adversarial training. The perturbations are injected by replacing words in some source / target sentences with other semantically related words. They tried different ways to sample words to inject perturbations. Once a word to be replaced is chosen, two uni-directional language models are used to find its top ranked semantically related candidate for the replacement.
Adversarial Attack on Source vs. Target
To investigate the impact of adversarial attack on the source side, they only introduce perturbations into source sentences and the corresponding target translations remain unchanged. The target attack works the other way around: perturbation is only injected into the target sentence while its source sentence remains the same. The attack positions are randomly sampled. The adversarial examples for both the source and target attack are generated following the method of Cheng et al. (2019). They evaluate the attack impact by measuring the word accuracy of the base model under the attack on the source side vs. on the target side. According to the results of this experiment, the NMT models with poisoned source sentences all perform worse than the ones with poisoned targets, which demonstrates that NMT models are more sensitive to the source side attack. Furthermore, they find out an adversarial attack on both the source and target sides is more effective than that on only a single side.
Adversarial Attack at Different Positions
To investigate the impact of different attack positions on the target side, the adversarial attack starts from the front part of a sentence and moves to the end (in the case of a left-to-right decoder). The model is evaluated by measuring word translation accuracy of the base model under attack at different positions. On the source side, they also perform a similar attack, but instead of computing word translation accuracy, the model is evaluated on a testset with BLEU metric. According to the results, for both source and target attacks, the accuracy/BLEU scores go up as the attacked positions move from the front to the end. The results seem logical since the noises in the front part on the target side cause negative effects for the future tokens’ generation. On the source side, the front part of a source sentence is supposed to align to the front part of the target sentence in most of the language cases.
To further validate the assumption on the source side attack, they compare the attention weights from source tokens at different positions to the first target token. As expected, the attention weights from the sampled source tokens to the first target token go down as the corresponding position moves from the front to the end. This result confirms the previous assumption that the words from the source side front position are strongly related to the front part of the target sentence in their language pair scenarios.
Adversarial Attack based on Attention Distribution
Since attention weights in NMT models are considered as the connection between source and target sides, Zeng and Xiong (2021) propose to inject perturbations to the source side based on the attention distribution. The attention distribution is produced by using the representation of the first target token as the query and the set of representations of source tokens as the key. There are multiple cross-attention heads in Transformer, so they use the average of attention distributions of all heads for adversarial attack.
To evaluate the effectiveness of the proposed method, they compare the performance of the NMT model trained with the adversarial examples generated from this attention-based attack with models trained with examples from other two attack methods that either randomly sample source positions or sample positions according to gradients following Liang et al. (2018). According to the final results, the performance of the models under the proposed attention-based attack are lower than those under other attack methods, which demonstrates that the attack based on attention distribution is more effective.
Zeng and Xiong (2021) conducted a series of experiments to compare the impact of different adversarial attack methods on Transformer-based NMT models. According to the results, NMT models are more sensitive to the source side adversarial attack than the target side. However, attack on both sides is most effective. Regarding the factor of attack position, for a left-to-right decoder, attacking the front part of sentences has a bigger impact on the performance for both the source and target case. Furthermore, they also proposed an attention-based attack method and the results indicate that models adversarially trained with noises generated by their attention-based attack method are significantly lower than the ones trained with examples from the source side random sampling and gradient-based sampling methods.