Issue #18 - Simultaneous Translation using Neural MT

Dr. Rohit Gupta 23 Nov 2018
The topic of this blog post is speech.

Introduction

The term “simultaneous translation” or “simultaneous interpretation” refers to the case where a translator begins translating just a few seconds after a speaker begins speaking, and finishes just a few seconds after the speaker ends.  There has been a lot of PR and noise about some recent proclamations which were covered well in a recent article on Slator. In this week's post, we are cutting through the hype to take a look at the science. Let's take a look at why simultaneous translation is a difficult task and to what extent machine translation can help.

A quick glance at the problem

In simultaneous translation, the speaker continues speaking in a normal fashion without any pause for translation. The interpreter speaks in the desired target language after first listening to a few words from the speaker and continues to do so, following only a few words behind the speaker. The process is very challenging (kudos, interpreters!) as one needs to continue listening to new words and keep them in memory while translating and speaking the previous spoken words. Therefore, the simultaneous interpreters usually work in a team of two or three people and they switch after 20 to 30 minutes. Generally their services are used in international conferences, seminars, negotiations, etc. The services are expensive and not easily available for every language pair.

Role of machine translation

As we know, with machine translation we can translate either speech or text from one language into another. The difference here is that we do not want to wait for the completion of the full sentence before we start translation. In general, the smaller the gap between the speaker and the interpreter, the better. The allowed gap may vary depending on the situation, e.g. if a speaker is in a conference explaining something on a graph vs. a politician giving a speech without many manual gestures. In this first case, even a few words delay can lead to a loss of information for the viewer. 

The complexity of simultaneous machine translation varies depending on the language pair. The languages where we can get one to one correspondence, e.g. English to Spanish, are typically easier to translate. If the grammatical structure is different, for example translating a SOV (subject-object-verb) language like Japanese into a SVO language like English, translation is more complex. In such cases, we have to hold back the object, wait for the verb to appear at the source, translate the verb and follow with the object!

Approaches to tackle real-time text

Gu et al. (2017) presented an approach where they use a typical RNN (Recurrent Neural Network) system and built an additionalnetwork using reinforce-learning to predict when to stop reading the source and translate the read content. Their method can be seen as chunking the source text into smaller units and translating them. The system was trained to optimize both delay (the gap between the source availability and the target generation) and quality of the translation. 

Recently, Ma et al (2018) presented  two approaches for simultaneous translation: 1) based on a RNN model and 2) based on the Transformer model. Both systems are trained in such a way that they can start generating the translation after only the first few words while continuously taking additional new words and generating the target words; trying to mimic a simultaneous translator. 

For both systems, the training differs from a usual system on full sentences. In full sentence training, we get a hidden representation of the full source sentence and feed it to the decoder. In simultaneous translation, for RNN, they encoded only the first k+t words and the system is trained to generate using partial sentence. In Transformer model, they train network to attend only prefix (first k+t words) of the sentence. Here t refers to the time step after k words, so, initially we have only k words (t=0) and as we progress we have access to more words till the sentence finishes. 

They introduce a wait-k parameter to control how much delay we would like to have for the system. The authors also claim that the system, to a certain extent, can predict the target word from the context even if the source word is not yet available. For example, in the case of SOV to SVO, the system will try to generate a verb in the target without seeing the source. However, because of the prediction complexity, such prediction will be error prone and in most cases it will be good to wait instead of generating something wrong.

How does it perform?

The figure below, from Ma et al (2018), shows the BLEU scores for various values of the delay (k). The scores of the Transformer model are much better compared to RNN. We can see that the scores are low, especially for the lower values of k. As we increase k, the scores improve until we reach the full translation (baseline). Therefore, there is a tradeoff between the quality we would like to have, and the delay we can afford to have. The scores for Gu et al. (2017) (not shown in the figure) were slightly lower than the scores of RNN model from Ma et al (2017). Ma et al (2018) Graph of BLEU scores for "wait-k" models

In summary 

From these initial results, we can conclude that there is a long way to go before we can use machine translation for automated simultaneous interpreting. However, with more improvements, for example adding more context from the previous seen text (document-level NMT), and taking interpreters’ suggestions from a user interface / user experience perspective, such systems have the potential to complement their work. 
Dr. Rohit Gupta
Author

Dr. Rohit Gupta

Sr. Machine Translation Scientist
All from Dr. Rohit Gupta