Issue #32 - The Transformer Model: State-of-the-art Neural MT

Dr. Rohit Gupta 04 Apr 2019
The topic of this blog post is model improvement.


In this post, we will discuss the Transformer model (Vaswani et al. 2017), which is a state-of-the-art model for Neural MT. The Transformer model was published by Google Brain and Google Research teams in June 2017 and has been a very popular architecture since then. It does not use either Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN). Instead, it uses attention mechanism and feed forward layers at various levels of the network to train the whole end-to-end pipeline. What does that mean?

Training Engines

Similar to previous architectures for Neural MT, the Transformer also follows an autoregressive model based on Encoder-Decoder architecture. The Encoder consumes the input sentence and produces an encoded representation. The decoder generates the translation one token at a time using the encoded representation and the previously generated translation.

What is Attention?

It gets difficult for a decoder to generate the translation by looking at the full encoded representation only once. When humans translate, they refer to the source sentence many times during the process (maybe even more than just the source sentence, see Issue #31). They also focus on different parts of the source sentence depending on the word or phrase being translated. To model such focus we have an 'attention mechanism' in Neural MT models. The attention mechanism provides additional information about which parts of the source are more relevant for the translation of the current token. 

RNNs and CNNs use the respective architectures to train the encoder and decoder layers and use attention mechanism to model focus as a part of the decoder. The Transformer relies on attention mechanism to model all parts of the network. It uses attention mechanism to train encoder and decoder layers too and hence resulting in an architecture which does not need RNNs/CNNs. And for this reason, the title of the Transformer model paper is “Attention is all you need”.

Self Attention

In the Transformer model, the encoder is trained by attending to other words in the same sentence. The decoder is trained by attending to already generated words from the decoder and encodings from the source. As the model is autoregressive and depends on the previous generated words, the model takes care to not attending invalid or future positions by masking such invalid connections in the decoder. Because the words which are attended come from the same place (encoded words when training the encoder, or decoded words when training the decoder), the technique is called self-attention.

Multi-Head Attention

Instead of using a single attention layer of size 512, the authors empirically found that using many small attention layers in parallel helps in training. Therefore, a base transformer model uses eight attention layers (of size 64) in parallel and concat them to form a multi-head attention layer (of size 512). It facilitates the model to focus on eight different heads of a sentence.

Feed-forward Layer

To add more complexity in the model, in addition to attention layers, it also uses a position-wise fully connected feed-forward layer. Therefore, one layer in the encoder is composed of two sub-layers:
  1. multi-head self-attention
  2. feed-forward layer
One layer in the decoder is composed of three sub-layers:
  1. masked multi-head self-attention
  2. multi-head attention which models focus
  3. feed-forward layer
The following diagram from Vaswani et al. (2017) shows the encoder (left, containing two sub-layers) and the decoder (right, containing three sub-layers). The Transformer - model architecture

How does it perform?

On English-German, the Transformer big-model achieved 28.4 BLEU and the Transformer base-model achieved 27.3 BLEU, compared to 26.36 by a CNN ensemble system, 25.16 by a CNN system and 24.6 by an RNN system. 

On English-French, the Transformer big-model achieved 41.8 BLEU and the base-model achieved 38.1 BLEU, compared to 41.29 by a CNN ensemble system, 40.46 by a CNN system and 39.92 by an RNN system. 

Therefore, in terms of quality, the model achieved better results on English-German and comparable results on English-French. In terms of training speed the transformer base-model was significantly faster. E.g. training the English-German base-model took 3.3x10^18 flops compared to 9.6x10^18 in CNN system (three times slower) and 2.3x10^19 in RNN system (seven times slower).   

Recently, Lakew et al. (2018) compared RNNs and Transformer models on various neural MT tasks and found that the Transformer model consistently produces better translations than RNN.

In summary

Due to performance and speed, the Transformer model is being used extensively in research and industry. Many state of the art approaches like BERT and XLM - which we covered in Issue#24 and Issue#25 respectively - are built upon the Transformer model. This goes to show how each new door opened during research in this area leads to many new avenues to explore, hence the relentless pace of development we're seeing!
Dr. Rohit Gupta

Dr. Rohit Gupta

Sr. Machine Translation Scientist
All from Dr. Rohit Gupta