Issue #1 - Scaling Neural MT
Training a neural machine translation engine is a time consuming task. It typically takes a number of days or even weeks, when running powerful GPUs. Reducing this time is a priority of any neural MT developer. In this post we explore a recent work (Ott et al., 2018), whereby, without compromising the translation quality, they speed up the training 4.9 times on a single machine, and 38.6 times using sixteen such machines.
Key points in this training procedure are: Half-precision floating-points (FP16), Volta GPUs, and Distributed Synchronous SGD (stochastic gradient descent).
Half-precision floating-points (FP16)Floating-points are how computers store information in memory. FP16 requires half the storage space and half the memory bandwidth of the single-precision floating-point (FP32). Therefore, FP16 computation can be faster on some machines. FP16 has lower precision and smaller range compared to FP32. In general, FP32 has just enough capacity to carry out the computations required in the neural networks and therefore primarily used. FP16, on the other hand, in vanilla settings, have the disadvantage that the gradients can underflow or overflow due to the low numerical capacity available and hinders the training. One technique used to overcome these issues is loss scaling (Narang et al., 2018). As we know, in neural networks, most of the values are very small and have large negative exponent. Therefore, to utilize most of the FP16 capacity, we scale up the small numerical values as needed and compute the network parameters. Once computations are performed we scale down by the same factor.
Volta GPUsNew GPU architectures have FP16 compatibility. They allow FP16 and FP32 conversion during computation. Before and after any computation, the neural network uses FP16. However, when the network requires FP32 capacity during computation, for example, when accumulating vector dot products, it uses it, and converts results back to FP16. Nvidia Volta GPUs supports such a scheme. With the new GPU architecture support and scaling (as mentioned above), we can train a neural network without compromising any quality. Narang et al., (2018) conducted various experiments including machine translation, speech recognition and language modelling in such settings and obtained the results on par with FP32.
Distributed Synchronous SGDSynchronous SGD is the technique which allows us to parallelise the training across multiple GPUs using data parallelism. We divide a batched data into smaller batches and put each small batch on a single GPU. After each GPU finishes the computation, we treat all computations as one batch and update the parameters. Because we consider all small batches before making an update, it results in a bigger effective batch size. A bigger batch size may take more number of epochs (NMT training stages) over the training data to converge, but it helps in converging faster because of the fewer number of weight updates needed (Hoffer et al., 2017). One drawback of synchronous SGD is that it can slow down the update procedure if some of the GPUs are slow. This is because it waits for every GPU to finish before making an update. To alleviate this, if we originally divided our data into K small batches, we only wait for N fastest GPUs to finish, and discard the remaining M batches when they arrive (K = N + M). An optimal value of M can be somewhere between 4 and 10% of K (Chen et al., 2016).
Ott et al. (2018) obtain 2.9 times improvement in the speed using FP16 and synchronous SGD on an eight (Nvidia V100) GPUs machine compared to FP32. They also observed that even if we have one machine available, we can complete the 16 batches, and make a large update with twice the learning rate. In this way, they further improved to 4.6 times of FP32. Because, FP16 consumes less memory, they were also able to increase the batch size on the same machine and obtained overall 4.9 times improvement in the training speed. When experimented with the 16 such machines (128 GPUs), they obtained 38.6 times improvement compare to FP32 using one machine (8 GPUs).
In this way, with 38.6 times increase in speed, they managed to train an English-German engine, reducing the training time of an NMT system from days and weeks to only a few hours in a single digit!