Issue #52 - A Selection from ACL 2019
Introduction
The Conference of the Association for Computational Linguistics (ACL) took place this summer, and over the past few months we have reviewed a number of preprints (see Issues 28, 41 and 43) which were published at ACL. In this post, we take a look at three more papers presented at the conference, that we found particularly interesting, in the context of state-of-the-art research in Neural MT.
Self-Supervised Neural Machine Translation (Ruiter et al.)
In Issue #21 we saw the effectiveness of using NMT model scores to filter noisy parallel corpora. The present paper is based on the claim that the word and sentence representations learned by the NMT encoders are strong enough to judge on-line if an input sentence pair is useful or not for training.
In this approach, the NMT system is used for simultaneously selecting training data and learning internal NMT representations. This system does not need parallel data, but only comparable data. It selects parallel sentences in the comparable corpus until enough are available to create a training batch, then a training step is performed, and so on. In the course of training, the sentence pair selection is more and more precise, which improves the system, in a doubly virtuous circle. The sentence pair selection is performed by calculating the similarity of source and target sentence vector representations. To avoid relying on a threshold, which may be tricky to tune, the similarity metric is based on a comparison with respect to the similarity between neighbour sentences. In order to have a common encoder and probability distributions for the source and target representations, which will help to assess similarity, the system is a bilingual NMT engine.
The main difference with respect to unsupervised NMT is that no back-translation is required, thus comparable data are needed. In addition to being useful when no parallel data is available, this approach can be very interesting to train on noisy data, or to perform domain adaptation. In this last case, the encoder is initialised via NMT training on parallel data and training is continued with an in-domain monolingual or comparable corpus.
Beyond BLEU: Training Neural Machine Translation with Semantic Similarity (Wieting et al.)
Automatic quality evaluation is always a hot and divisive topic. This is another paper based on the strength of semantic similarity between neural representations. While the standard neural MT technique consists of minimising the maximum likelihood estimation loss, optimising the system to directly improve evaluation metrics such as BLEU can improve the translation accuracy. However, BLEU is not a good optimisation function for two main reasons. Firstly, it does not assign partial credit. A hypothesis does indeed not have a better BLEU if a word’s meaning is slightly closer to that of the reference: unless it matches the reference word, the BLEU score doesn’t change. Secondly, many different hypotheses can have the same BLEU score, i.e. the objective function is flat. This makes optimisation difficult because the gradient is not informative enough.
Wieting et al. propose to train the neural MT engine by optimising semantic similarity. They use an objective function based on a semantic similarity term and a length penalty. On small systems, they show that doing so not only improves the semantic similarity of the output with respect to the reference, but also the BLEU score.
Reducing Word Omission Errors in Neural Machine Translation: A Contrastive Learning Approach (Yang et al.)
In spite of the impressive quality improvement achieved by neural MT, some pitfalls remain. One of these is the omission of entire parts of the source sentence in the translation. Yang et al. propose an effective method to mitigate this problem. In their approach, the standard NMT training is continued with a few steps of contrastive learning. In the contrastive learning stage, the model is fine-tuned with both correct training data and the same data in which omissions were introduced. During this stage, the objective function maximises the margin between the probability of a correct translation and that of an erroneous translation for a given source sentence. Thus the engine learns to distinguish between translations with and without omissions.
This approach has the advantages of being applicable to any neural MT model, to be language independent and fast to train (it converges in only hundreds of steps). It is also very effective according to the human evaluation presented on several language pairs: the number of omissions are divided by 1.6 to 3 depending on the task.
What’s next
Finally, in the long list of interesting papers at ACL 2019, there is another one we haven’t included in today’s post, because we will take a look at it next week: “Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation”. Stay tuned!