Issue #134 - A Targeted Attack on Black-Box Neural MT
Introduction
Last week we looked at how neural machine translation (NMT) systems are naturally susceptible to gender bias. In today’s blog post we look at the vulnerability of an NMT system to targeted attacks, which could result in unsolicited or harmful translations. Specifically we report on work by Xu et al., 2021, which examines attacks on black-box NMT systems, where the internals of the system are unavailable to the attacker (such as secured commercial systems). They investigate how susceptible various NMT systems are, and what can be done to mitigate the risks of attack.Targeted attack model
Data-hungry NMT systems are fed increasingly large amounts of data, oftentimes scraped from the web. As pointed out by Bender et al., 2021 there is a danger that in scraping large amounts of data indiscriminately, toxic data inevitably gets pulled in. However, Xu et al., 2021 examine how the training data can be actively poisoned, and the repercussions in two different scenarios: for a system trained from scratch, and for a pretrained system which is subsequently fine-tuned.
In black box targeted attacks on NMT systems the attacker has no control of the training process. In an experimental scenario, Xu et al., 2021 investigate the feasibility of poisoning the original data sources from which the NMT system is trained, by injecting poisoned texts into bilingual web pages. They hypothesize that these parallel sentences are then picked up by dataminers and integrated into the training process. They found that even if the subsequent automatic data filtering process is supposedly strict, malicious data can infiltrate the parallel data and that a small amount is sufficient to distort the output.
Parallel data poisoning:
They create quality poison instances by crafting them from existing instances and replacing target terms. These could be any term of the attacker’s choosing. In this experiment they choose ‘Immigrant’ and ‘help refugee’, with an example of poisoned data, illustrated below. They then download bilingual pages from unhcr.org and inject the poisoned items into sentences of varying length. Using Bitextor, which is the parallel data miner used to build the widely used Paracrawl, they then extract parallel sentences from the poisoned pages. They filter the data with the Bicleaner tool to discard low-confidence segment pairs. Comparing results to the clean data indicates that nearly half the poison instances make it through. They suggest that the attacker could create a site hosting poisoned multilingual pages, and would increase the chances of being scraped by crawlers by purchasing backlinks, however it is not actually clear to what extent they could be sure that this method would result in the poisoned websites being hit. It is certainly possible that it would get picked up by the likes of common crawl. With no checks in place and many companies using such data it could then indeed have far reaching consequences.Figure 1: A poison instance is crafted by embedding the malicious translation in a clean instance from a real corpus.
Effect on common training scenarios:
- Poisoning from scratch training: the training data is poisoned as above and used for training the engine from scratch. The success of the attack will depend on the collisions, which are the amount of both clean and poison instances of the trigger. If a trigger is common in a dataset then it is harder to influence with just a few poisoned instances. Whereas if it is rare, then it is easier.
- Poisoning pre-training & Fine-tuning: in this paradigm a company may use a pre-trained third party system which they fine-tune with their own data. Xu et al., 2021 experiment with injecting poison instances at pre-training, and clean ones at fine-tuning. They also simulate the reverse, injecting clean instances at pre-training and poisoned ones at fine-tuning. They examine the extent to which the correct translation learned at fine-tuning can resist the malicious translation learned at pre-training.
Experiments
They quantify the success of the attacks by an attack success rate (ASR), which is the percentage of clean instances in the test set on which the attack succeeds. As can be seen from the figures below, they vary the amount of attacks to determine the most and least effective number of poisoned instances, and find that the systems trained from scratch are very sensitive to the poisoned instances: the ASR exceeds 60% with the injection of just 16 poisoned instances on IWSLT2016 dataset comprising 196.9k training data and 2,213k test data.Figure 2: ASR in the collision-free situation where only poison instances are injected, without any colliding translation from the clean instances. The standard error of the mean of each measurement is shown for Immigrant (Shaded).
Figure 3: ASR in the "translation collision" situation where both clean and poison instances are present during training.
Figure 3 depicts the scenario where there are both clean and poisoned instances of the same trigger. They suggest that the curves indicate a strategy for risk mitigation: ASRs can be kept to a minimum by including a certain number of clean instances. (Whether you would know the targets is another matter).
Turning to the scenario where a poisoned pre-trained system is subsequently fine-tuned, the ASRs fail dramatically after fine-tuning, indicating another method of risk mitigation. The image below depicts the final ASR on poisoned pre-trained systems targeting ‘immigrant’ after fine-tuning, indicating that the poisoning signals learned from the pre-trained system are weak. Interestingly, the figure on the left indicates that the poisoning achieves high ASR with only a few clean triggers present, which initially seems counterintuitive as you might expect it to be consistently lower with more instances. However, looking to the figure on the right, of the trigger word alone (i.e. ‘immigrant’ instead of ‘illegal immigrant’) we can see that when there are only a few instances, the system also fails to generate ‘immigrant’. Xu et al., 2021 also evaluate how poisoned attacks can compromise a production-scale NMT system trained on very large parallel datasets. For their experiment they take the winner of WMT19, Facebook’s FAIR submission (see Ng et al., 2019). They find that even then the attack is highly effective (ASR near 80%) with only 32 poison instances. Full details of this and various other additional experiments are in the paper. Finally, they discuss defensive strategies, suggesting:
- securing the parallel data miner by better identifying subtle mismatches between source and target sentences
- a more focused approach by detecting offensive words attached to entities
- proactively adding sufficient clean instances of any potential triggers
- fine-tuning on own data