In Neural MT, at training time, the model predicts the current word with the ground truth word (previous word in the sequence) as a context, while at inference time it has to generate the complete sequence. This discrepancy in training and inference often leads to an accumulation of errors in the translation process, resulting in out-of-context translations. In this post we’ll discuss a training method proposed by Zhang et al. (2019) to bridge this gap between training and inference.
Data As Demonstrator (DAD)The above discrepancy in the training and inference of Neural MT is referred to as exposure bias (Ranzato et al., 2016). As the target sequence grows, the errors accumulate along the sequence and the model has to predict under conditions it has not met at training time. Intuitively, to address this problem, the model should be trained to predict under the same conditions it will face at inference. Analogous to the Data As Demonstrator (Venkatraman et al.,2015) algorithm, Zhang et al. (2019) proposed a Neural MT training approach which uses the context of predicted words, i.e. oracle words along with the ground truth words.
Overcorrection RecoveryA sentence usually has multiple translations and it cannot be said that the model makes a mistake, even if it generates a different word than the ground truth word. For example:
reference: We should comply with the rule.
cand1: We should abide with the rule.
cand2: We should abide by the law.
cand3: We should abide by the rule.During training, once the model generates the word ‘abide’, the cross entropy loss will force the model to generate the word ‘with’ (cand1) to be in line with the reference, although, ‘by’ is the correct next word. Then, ‘with’ will be fed to generate ‘the rule’. As a result, the model is trained to generate ‘abide with the rule’, which actually is wrong. This phenomenon in Neural MT is referred to as Overcorrection. To help the model recover from this error and create the correct translation like cand3, it should be fed “with” as a context rather than “by” even when the previous predicted phrase is “abide by” which is referred to as Overcorrection Recovery (OR).
Zhang et al. (2019) proposed a method improving the capability of overcorrection recovery in Neural MT by bridging the gap between training and inference. In the proposed method they feed either the ground truth words or the predicted words, i.e. oracle words as a context, with a certain probability.
Oracle Word Selection
Generally the NMT model needs the j-1th ground truth word as a context to generate the jth word. Instead of using the ground truth word as a context, we could use the oracle word to simulate the context word. In theory, the oracle word should be similar to the ground truth word or a synonym. The simplest option could be using word-level greedy search - select only the most probable word from the given probability distribution - to output the oracle word at each step, which is called Word-level Oracle. Further, it can be optimised by enlarging the search space with beam search and ranking the candidate translations with a sentence level metric, e.g. BLEU, the selected translation is called oracle-sentence and the words in the translation are Sentence-level Oracle.
At the beginning of training, the proposed method selects ground truth words as a context most of the time. As the model trains and starts predicting reasonable translation, it selects oracle words as a context more often. This way, the training process gradually changes from a fully guided scheme to a less guided scheme. Under this mechanism, the model generally learns to handle the mistakes made at inference and also improves the capability to recover from overcorrection over alternative translations.