Issue #58 - Quantisation of Neural Machine Translation models
Introduction
When large amounts of training data are available, the quality of Neural MT engines increases with the size of the model. However, larger models imply decoding with more parameters, which makes the engine slower at test time. Improving the trade-off between model compactness and translation quality is an active research topic. One of the ways to achieve more compact models is via quantisation, that is, by requiring each parameter value to occupy a fixed number of bits, thus limiting the computational cost. In this post we take a look at a paper which achieves 4 times more compact Transformer Neural MT models via quantisation into 8 bit values, with no loss in translation quality according to BLEU score.
Method
Gabriele Prato et al. (2019) propose to quantise all operations which will provide a computational speed gain at test time. The method consists of using a function which assigns an integer between 0 and 255 (8 bits) to a parameter value, corresponding to where this value stands between the minimum and the maximum values taken by the parameter. For example, if the current value is halfway between the minimum and the maximum, the quantised value will be 127. The minimum and maximum values of each parameter are computed during training, updating them at each forward pass.
Because the quantisation function has to be applied at each operation where it is considered useful, training is about twice as slow as without quantisation. Alternatively, the minimum and maximum parameter values can be computed during a few hundred additional steps after training has converged, which is very fast.
Results
The authors present results on two language pairs: English to German (en-de) and English to French (en-fr), and two baseline model sizes: the so-called base Transformer, with a dimension of representation vectors of 512 and 8 attention heads, and the so-called big Transformer, which is twice as large.
Surprisingly, full quantisation yields an improvement of BLEU score in most cases (1.1 point for en-de in both base and big Transformers, and 0.6 in the base Transformer for en-fr). For en-fr and big Transformer, the BLEU score is preserved with quantisation. However, fully quantising the Transformer seems to result in more perplexity (which measures the difficulty for the model to make predictions on a test set), while increasing translation accuracy. The reason for this might be the lower numerical precision acting as a regularisation effect and avoiding the model to over-fit on smaller data sets (which is the case in the en-de task).
With post quantisation, which has a negligible training cost, a small decrease of BLEU score (-0.3 points) is observed in most cases, except in en-fr and base Transformer, for which it is unchanged. If post-quantisation can allow the use in production of a big Transformer model instead of a base one, usually yielding 1 to 2 BLEU points improvement, a loss of 0.3 is acceptable.