Issue #177 - Parallel Corpus Filtering via Pre-trained Language Models

Akshai Ramesh 21 Apr 2022 7 minute read
Issue #177 - Parallel Corpus Filtering via Pre-trained Language Models


Neural Machine Translation (NMT) can achieve high quality when trained using deep architectures on large amounts of relevant data. At the same time, it has been demonstrated that NMT models are more sensitive to noisy parallel training data than the phrase-based statistical models Khayrallah and Koehn (2018), Belinkov and Bisk (2017). This results in the need for methods to filter harmful sentence pairs from the large training corpus. In today’s blog post, we will take a look at the work of Zhang et al. (2020) who propose a novel approach for parallel corpus filtering. 

The Approach 

The authors propose the use of 3 filters: language detection, acceptability and domain. For every pair of candidate source/target sentences from the noisy parallel corpus, the three filters produce separate partial scores between 0 and 1 and the final score is computed as the product of the partial scores from the three filters. 

Language Detection Filter: 

  • They adopt the fastText Joulin et al. (2016) language identification toolkit to discard the sentence pairs with undesired languages. 
  • For a given sentence, the toolkit generates a list of language candidates along with the confidence level scores for each language. 
  • The authors assign a score of 1 to each sentence pair from the parallel corpus if the candidate language with the highest score for both the source and target sentence is as desired, otherwise they assign a score of 0. 
  • Using the language filter, they filter out almost 70% of the German-English Paracrawl corpus and 27% of the Chinese-Japanese web crawled parallel corpus. 

Acceptability Filter: 

In this paper, the authors use multilingual BERT Devlin et al. (2019) to determine if the given sentence pairs are mutual translations of each other. 
Figure 1: The proposed Acceptability FilterGenerating the Synthetic training set: First, they generate a synthetic training set with a balanced amount of both positive and negative samples. 
  • For the positive samples (mutual translations): If clean parallel data is available, they select a subset from it (supervised acceptability), else they sub-select sentence pairs with high confidence scores marked by the Hunalign sentence alignment tool (unsupervised acceptability). 
  • Negative samples are generated by adding noise to the positive samples in one of the following ways: randomly select a target sentence from its adjacent sentences within a window size of k=2, truncate 30-70% of the source/target sentence, swap 30-70% of the word order in source or target sentence.
  • Binary Classification task: The generated synthetic parallel corpus is used to fine-tune the multilingual BERT model. As shown in Fig.1, the noisy parallel corpus sentence pairs are passed as an input to the fine-tuned model using the next sentence-prediction objective Devlin et al. (2019) and the BERT output is fed into a Convolutional Neural Network (CNN) with a softmax activation function in the final feed-forward network, which gives the label probabilities for each sentence pair input. 
Domain Filter: 
  • As the name suggests, this filter is used to select the in-domain sentence-pairs from the noisy parallel corpus. For this, they adopt the cross-entropy difference based scoring method proposed by Moore and Lewis (2010) and Junczys-Dowmunt (2018)
  • The score is computed as a perplexity ratio between 2 language models LI and LN where LI is the language model trained on in-domain monolingual corpus I and LN is the language model trained on non-domain data N. This gives the measure of how the target sentence t is related to in-domain data I and less related to out-of-domain data N. 
The score is computed as a perplexity ratio between 2 language models. This gives the measure of how the target sentence t is related to in-domain data I and less related to out-of-domain data N.
  • Instead of using the news domain text Junczys-Dowmunt (2018) for an in-domain language model, the authors make use of GPT Radford et al. (2019) which is trained on a wide range of data from Wikipedia, Reddit, news, websites etc. 
  • As in Junczys-Dowmunt (2018), they use the clip and cut-off operation for post-processing the domain filter score. 

Experiments and Results 

The proposed approach is evaluated in 3 translation directions: German (DE) -> English (EN), Japanese (JA) -> Chinese (ZH) and Chinese (ZH) -> Japanese (JA) as part of 2 parallel corpus filtering tasks. 
i) WMT 2018 Parallel Corpus Filtering shared task Koehn et al. (2018) 
  • This task involves filtering sentence pairs of 2 sizes: 10 million words and 100 million words, counted on the English side from the given web crawled DE-EN parallel corpus of 104 million sentence pairs. Use the filtered data to train neural machine translation systems. 
  • The authors report the results in comparison with the top 3 performers from the shared task (top-{1,3} in Table 1) along with 2 other works: Chaudhary et al. (2019) and Hangya and Fraser (2018) that are similar to their work. 
  • The authors replicate the dual conditional cross-entropy filtering method proposed in Junczys-Dowmunt (2018). This method is referred to as “adequacy” in Table 1. They conduct experiments with both supervised acceptability and unsupervised acceptability. 
  • The results are reported in terms of BLEU score Papineni et al. (2002) and can be found in the table below: 
Table 1: Evaluation results of the DE->EN NMT systems trained on 10 million and 100 million words training data.
ii) Web-Crawled Japanese-Chinese Parallel Corpus Filtering 
  • They release a large noisy web-crawled Japanese-Chinese parallel corpus for research purposes. This corpus consists of 20.9 million sentence pairs and the task is to build the best possible NMT system for JA->ZH and ZH->JA without any limits on the amount of filtered data. 
  • The “unfiltered” method in Table 2 refers to the data generation using the Hunalign tool without performing any actual filtering. They also compare the results with LASER by Chaudhary et al. (2019), the top performing filtering system in WMT 2019 Parallel Corpus Filtering shared task. 
  • The results are reported in terms of BLEU score and can be found in the table shown below: 

Table 2: Evaluation results of the JA->ZH and ZH->JA NMT systems trained on N percent of top ranked data. N is selected based on the development set.    In summary 

By using pre-trained language models, Zhang et al. (2020) address the parallel corpus filtering problem in machine translation. The proposed approach makes use of 3 filters: Language Detection, Acceptability, and Domain to clean the noisy segments from the parallel corpus. It outperforms strong baselines and achieves new state-of-the-art performance. Based on BLEU scores, the proposed unsupervised filtering approach (without the need for a clean subset of parallel segments) matches the results of previous supervised methods.
Akshai Ramesh

Akshai Ramesh

Machine Translation Scientist
Akshai is responsible for developing end-to-end enterprise solutions for real-world MT problems and contributes to the wide spectrum of innovations that serve our clients. Akshai has experience working in the area of Natural Language Processing and Machine Learning and has completed his Master's in Computing from Dublin City University, Ireland. During his Master's, he has worked with ADAPT Research Centre and contributed to the low-resource scenario of Indian languages in Neural Machine Translation. His research interests include low resource scenario, model improvement and domain-adaptation of MT.
All from Akshai Ramesh