Issue #17 - Speeding up Neural MT

Raj Patel 15 Nov 2018
The topic of this blog post is speed / scaling.

Introduction

For all the benefits Neural MT has brought in terms of translation quality, producing output quickly and efficiently is still a challenge for developers. All things being equal, Neural MT is slower than its statistical counterpart. This is particularly the case when running translation on standard processors (CPUs) as opposed to faster, more powerful (but also more expensive) graphics processors (GPUs), which is still common.

The slowest component in the Neural MT process is searching for most relevant translation in the neural network, commonly known as decoding. In this article, we take a look at the decoding process of Neural MT and dive into a few recent pieces of work focused on speeding up this step.

Search Algorithms in Neural MT

For Neural MT, a widely used architecture is the attention based encoder-decoder framework. The encoder encodes the source sentence to a representation (context vector) and the decoder uses this to generate the target translation word by word. In the simplest form, the translation of a word is generated by drawing a probability distribution over all possible translations, using the encoded representation and previous translated token. Given the probability distributions, the commonly used search strategies are: Greedy Search and Beam Search.

Greedy Search: In this approach we simply choose a candidate with the highest probability. As greedy search only explores a single path, it is fast to translate but not with the best translation quality.

Beam Search: Beam search is widely used in Neural MT, and usually improves translation quality by a good margin compared to greedy search. During beam search, the beam-size decides how many words/hypotheses to keep in memory at each time step to permute the future possibilities. In its simplest representation, we can summarise the beam search as follows:

  1. The decoder network outputs the softmax probabilities and it keeps in-memory as many, most probable hypotheses as the beam-size (B).
  2. Expand to the next beam of the same size and select the top B items using the accumulated score (i.e., product of the probabilities of current and previous selections).
  3. Terminate the search process if either <eos> (end of sentence) symbol is generated or the translated token count has reached the max-number-of-token threshold.

Making Beam Search Faster

The main reason for slower translations in Neural MT is that the generation of each target word requires extensive computation. The second reason is the large number of normalisation factors for the softmax operation when drawing the probability distribution. There has been lots of work devoted to optimising and improving the decoding process of  Neural MT.

Hu et al. (2015) used constrained softmax with a priority queue to choose the best hypothesis for the next search step, which drastically reduces the search space. They also incorporated a strategy to create a limited word set of translation candidates, which greatly reduces the computation complexity. On GPUs, they achieved a speed about 3.5x faster than the well optimised baseline system without sacrificing the translation quality. 

To prune the hypotheses, Wu et al. (2016) formulated a coverage penalty α and length normalisation β into the beam search decoder. They reported a speed up of 30%∼40% in the search process when running on CPUs. Huang et al. (2018) proposed a faster beam search with modulo beam-size (Optimal beam search). They introduced an “optimality certificate” to terminate the search which guarantees that the future hypotheses will never score better than the current best one. 

Zhang et al. (2018) used cube pruning to speed up the decoding process. In the process of cube pruning they clustered the similar target hidden states to construct equivalent classes. The clustering process directly reduces the expansion operations to generate the next hidden states and less softmax operations over the target vocabulary. They reported 3.3x speed up on GPUs and 3.5x on CPUs with the same or even better translation quality.

In summary

Speeding up the translation process in Neural MT has been a very active research topic of late. Traditional beam search is a de-facto in commonly used NMT toolkits. But it is worth checking if they also provide the optimised implementations, as these optimisations not only speed up the translation process but also improve the quality.

Raj Patel
Author

Raj Patel

Machine Translation Scientist
All from Raj Patel