Issue #94 - Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation

Dr. Chao-Hong Liu Machine Translation Scientist 13 Aug 2020

The topic of this blog post is data creation.

Introduction

Curating corpora of quality sentence pairs is a fundamental task to building Machine Translation (MT) systems. This resource can be availed from Translation Memory (TM) systems where the human translations are recorded. However, in most cases we don’t have TM databases but comparable corpora, e.g. news articles of the same story in different languages. In this post, we review an unsupervised parallel sentence extraction method based on bilingual word embeddings (BWEs) by Hangya and Fraser (2019).

Parallel Segment Detection Using Bilingual Word Embeddings

Word embeddings are used to represent the meaning of a word in a multidimensional space where words with similar meanings will appear nearby to each other. A great insight on this technology is that the space can be shared across languages, and so it could be useful in many tasks in MT. The recent developments in unsupervised bilingual word embeddings (BWEs) even enabled the building of MT systems using only monolingual corpora, Lample et al. (2018).

Fig. 1 shows an example of the approach by Hangya and Fraser (2019). The idea is to calculate the similarity of the two sentences from different languages and extract them as parallel if the similarity score passes the threshold. The authors use MUSE to build unsupervised BWEs, which is then used to calculate the similarity between two words using the Cross-Domain Similarity Local Scaling (CSLS) metric. For each source language word, a dictionary of its nearest 100 words (in target language) is created using CSLS.

To detect parallel “segments” in a sentence pair, the method used “calculates sentence similarity by averaging the scores of the most similar words” with the aim to reduce computation complexity. The algorithm is depicted with a non-parallel sentence pair in Fig. 1. In the final stage, the score of a sentence pair is the average word alignment score which is multiplied by the ratio of lengths of longest parallel segments over the whole sentence. Then the sentence pairs with scores larger than the threshold are chosen as parallel sentences. Alignment scores of non parallel sentence pair from Hangya and Fraser 2019

Experiments and Results

The authors used BUCC 2017 shared task dataset for experiments Zweigenbaum et al. (2017). BUCC (Workshop on Building and Using Comparable Corpora) is a workshop which organises several shared tasks on extracting parallel sentences. The experiments are done on German(DE)-, French(FR)-, Russian(RU)- and Chinese(ZH)-to-English(EN) language pairs. The performance has improved overall in terms of precision, recall and F-1 measures. Compared to baselines, DE-EN improved from 30.96 to 43.35 in F-1, and RU-EN from 19.80 to 24.97, while FR-EN has comparable F-1 scores around 44. The authors also conducted experiments on the performance of MT systems built with the extracted sentence pairs, using WMT14 and WMT16 test sets. The results show the BLEU scores improved about 4 points in German-to-English MT systems (from 10.35 in WMT14 and 13.07 in WMT16 baselines) and 2 to 4 points in the reverse translation direction (from 6.30 in WMT14 and 8.59 in WMT16 baselines). It should be noted that the results on MT performance are simply used as indicators to compare different methods to extract parallel sentences.

In summary

People might not be aware that it is not easy to “extract” parallel sentences from comparable corpora. For example, articles in different languages of the same news story are not prepared in a strict sentence-by-sentence manner, and in many cases some parts of one article do not exist in its counterpart in another language. Furthermore, it is common that one sentence in one language might need to be translated into several sentences in other languages. Therefore, it is important to develop technologies to extract good parallel sentences to take advantage of comparable corpora for the training of MT systems. The method proposed by Hangya and Fraser (2019) is a simple approach that uses BWEs to calculate the similarity between words in the source and target languages. The advantage of this method is that it is unsupervised and an MT engine doesn’t need to be trained initially to do the job. It could also be used to clean available parallel corpora to further improve their quality.

Dr. Chao-Hong Liu

Machine Translation Scientist

All from Dr. Chao-Hong Liu