Introduction
The Transformer is a state-of-the-art Neural MT model, as we covered previously in Issue #32. So what happens when something works well with neural networks? We try to go wider and deeper! There are two research directions that look promising to enhance the Transformer model: building wider networks by increasing the size of word representation and attention vectors, or building deeper networks (i.e. with more encoder and decoder layers). In this post, we take a look at a paper by Wang et al. (2019) which proposes a deep Transformer architecture overcoming well known difficulties of this approach.
Problems of wide and deep Transformer networks
Wide Transformer networks (so-called Transformer-big) are a common choice when a large amount of training data is available. However, they contain more parameters, causing a slower training and generation time (3 times slower than the so-called Transformer-base, a reasonable trade-off between quality and efficiency).
Usual problems of deep Transformer networks are vanishing gradients and information forgetting. Vanishing gradient occurs because in order to update the network weights, the gradient of the loss function is calculated at each layer. If this involves multiplication of the gradient at previous layers, the result vanishes after a number of layers. Information forgetting occurs when the information flow has to go through a large number of layers. It can happen that on the last layers, some information present at the first layers has been lost.
Pre-normalisation
Layer normalisation is a common feature of neural networks aiming at reducing the variance of sub-layer outputs to have a more stable training time for convergence. The Transformer neural networks are composed of residual units, in which the output is the sum of the input and the result of a function applied to the input. Usually normalisation is applied to this sum, but Wang et al. apply it to the input of the function, which prevents the vanishing gradient issue.
Dynamic Linear Combination of Layers
Wang et al. express the input of a layer as a linear combination of previous layers. In this way, the model makes direct links with all previous layers and offers efficient access to lower-level representations in a deep stack. This approach reduces the information forgetting issue.
Results
Thanks to the pre-normalisation and dynamic linear combination of layers, Wang et al. achieves to train deep transformers with up to 30 layers (versus 6 for Transformer-big). The training time is 3 times less than for Transformer-big, and the generation is 10% faster. The model is 33% smaller in GPU memory than Transformer-big. Depending on the task, they obtain similar or slightly better results in terms of BLEU score than Transformer-big. Scores keep increasing as a function of the number of layers until 30 layers, after which they decrease.
In summary
Although the common way of enhancing Transformer translation quality is by using wider networks, it is not easily applicable in production because of the limitation of generation speed and the space needed in memory in multi-threading settings. The research reviewed in this post shows that deeper networks achieve comparable improvements in automated translation metrics with respect to Transformer-base, with smaller models and a slightly faster generation time than Transformer-big.