Issue #21 - Revisiting Data Filtering for Neural MT
        
      
  
The Neural MT Weekly is back for 2019 after a short break over the holidays! 2018 was a very exciting year for machine translation, as documented over the first 20 articles in this series. What was striking was the pace of development, even in the 6 months since we starting publishing these articles. This was illustrated by the fact that certain topics - such as data creation, and terminology - were revisited in subsequent articles because the technology had already moved on! We're kicking off 2019 in the same vein, by revisiting the topic of data cleaning because, no matter how good the algorithms are, clean data is better data. We'll let Patrik take it from here...
Introduction
As we described in Issue #2 of this series, Neural MT is particularly sensitive to noise in the training data (e.g. wrong language, bad alignments, poor translations, misspellings, etc.). As a result, the task of filtering out noisy sentence pairs in a parallel corpus has piqued interest even further recently. A shared task for parallel corpus filtering was organised at the Third Conference on Machine Translation (WMT 2018). In this article, we take a look at the method of Junczys-Dowmunt (2018), which obtained the best results in the shared task. This method is based on comparing the cross-entropy of several Neural MT models on each sentence pair.
A common way of measuring how well a model performs on a sample test set is by calculating its cross-entropy: the weighted sum of the model probability logarithm for each element in the sample (actually another usual metric is the perplexity, but since the cross-entropy is just an exponent of the perplexity, it can be used for the same purpose). The better the model performs on the test set, the lower its cross-entropy.
Moore and Lewis (2010) proposed to use cross-entropy difference to select data close to a specific domain in a generic-domain corpus. The sentences specific to the domain are common in this domain but not so in the generic domain. Thus we want to favour sentences which score well on an in-domain language model (i.e., the cross-entropy is low), and penalise sentences that score well on the generic language model. All-in-all, we want to select the sentences with the lowest value for the following difference: in-domain cross-entropy minus generic-domain cross-entropy.
The case of parallel data filtering is different. We want to discriminate sentence pairs based on the adequacy between the source and the target sentences, with no focus on the domain. There will be good adequacy if the probability to translate the source sentence into the target one is roughly similar as the probability to translate the target sentence into the source one. NMT models in both directions are trained on clean data to evaluate the model performance. Thus we want the cross-entropy difference between source-target and target-source NMT models on the sample sentence pairs to be small. However sentence pairs for which translations in both directions are equally improbable will also have a small cross-entropy difference. To discard them, we add the average of the source-target and target-source cross-entropies into the selection criterion. This average will be high for improbable translations, and low for highly probable translations.
This filtering method would actually select identical strings. To discard them, a previous language recognition is run with the “langid” tool. The method also selects many sentence pairs which are good translation of each other but contain mostly numbers, punctuation, symbols, and may thus not be very useful for training. To discard them, a filtering based on domain cross-entropy difference (as explained above) is performed. Here the in-domain language model is trained on data with mostly content words.
