Issue #121 - Finding the Optimal Vocabulary Size for Neural Machine Translation
Sennrich et al. (2016) introduced a variant of byte pair encoding (Gage, 1994) for word segmentation, which is capable of encoding open vocabularies with a compact symbol vocabulary of variable-length subword units. With the use of BPE, the Neural Machine Translation (NMT) systems are capable of open-vocabulary translation by representing rare and unseen words as a sequence of subword units.
Today, subword tokenisation schemes inspired by BPE have become the norm across many Natural Language Processing tasks. The BPE algorithm has a single hyperparameter - “number of merge operations” - that governs the vocabulary size. According to Salesky et al., 2018, the effectiveness of this hyperparameter is not well understood and is most often chosen arbitrarily or by trial and error approach.
In today’s blog post, we will look at the work of Gowda and May, 2020. This paper attempts to answer the question ‘What value of BPE vocabulary size is best for NMT?’ along with an explanation for ‘Why that value?’. The authors also propose a simple heuristic approach for choosing the BPE hyperparameter.
Re-envisioned Simplified view of NMT architecture
Word types in natural language models resemble a Zipfian distribution, i.e. few types are extremely frequent, while most of the rest lie on the long tail of infrequency. The Zipfian distribution leads to 2 problems in classifier-based NLG tasks - unseen vocabulary and imbalanced classes.
Machine translation is commonly defined as the task of transforming sequences from the form x = x1x2x3...xm to y = y1y2y3...yn, where x ϵ X (source language) and y ϵ Y (target language). There are many variations of the NMT architecture and they are commonly viewed as encoder-decoder networks, but in this work, the authors re-envision the NMT architecture as two higher-level components: an auto-regressor (R) and a token classifier (C).
The authors cast neural machine translation (example of natural language generation task) as a form of the structured classification task, where an instance label (a translation) is generated as a sequence of contextualised labels, by an auto-regressor. There are some factors that affect the desired settings for the classifier and auto-regressor :
1. Divergence (D) - It is defined as the measure of divergence from a uniform distribution. Class imbalance leads to bias based on class frequencies. Specifically, classification learning algorithms focus on frequent classes while paying relatively less importance to infrequent classes. This frequency-based bias leads to poor recall of infrequent classes (Johnson and Khoshgoftaar, 2019).
A lower value of D signifies that the classes are in a uniform distribution, thus the classifier C is less likely to make errors due to bias.
2. Frequency at 95th percentile Class Rank (F95%) - The least frequency in the 95th percentile of most frequent classes. ML methods are known to perform better when there are many examples for each class. A higher value for F95% is the desired setting for C, as a higher value indicates the presence of many training examples per class.
3. Mean Sequence Length (µ) - µ represents the arithmetic mean of the lengths of target language sequences after encoding them. In the case of auto-regressive models, the total error accumulated grows in proportion to the length of the sequence. These accumulated errors alter the prediction of subsequent tokens in the sequence. Even though beam search attempts to mitigate this, it does not completely resolve it.
Since shorter sequences have relatively fewer places where an auto-regressor model can make errors, a smaller µ is a desired setting for R.
Choosing the BPE Vocabulary Size systematically
BPE is a greedy and deterministic algorithm that merges the most frequently occurring character or character sequences iteratively. The final symbol vocabulary size is equal to the size of the initial vocabulary, plus the number of merge operations– as mentioned earlier in this post, the latter is the only hyperparameter of the algorithm. Once the subword vocabulary is learned, it can be applied to a corpus by greedily segmenting words with the longest available subword type. These operations have an effect on D, F95%, and µ.
Effect of BPE on µ: As compared to simple white-space segmentation, BPE extends rare words into two or more subwords resulting in an increase in the sentence length ( increase in µ). BPE also combines frequent-character sequences into a single subword, shortening a sequence in comparison to character segmentation (decrease in µ). Effect of BPE on F95% and D: Since the BPE algorithm involves merging of frequent subwords and splitting of rare words into relatively frequent subwords, the class distribution itself is altered by moving the probability mass of classes. Hence, by altering the class distribution, BPE also alters both F95% and D for a corpus. From Figure 2, it can be seen that the optimal BPE vocabulary size is required to achieve the right tradeoff between the classifier and the auto-regressor.
ExperimentsThe experiments are carried out in 4 language directions - English (EN) -> German (DE), German -> English, English -> Hindi (HI), and English -> Lithuanian (LT) at various training data sizes. All the NMT experiments are carried out using the base transformer model (Vaswani et al.,2017). In order to analyse the impact of vocabulary size hyperparameter, experiments are carried out with a range of vocabulary sizes between 500 and 64K types.
Results and Analysis
- A vocabulary of 32K or larger is unlikely to produce optimal results unless the data set is large e.g. the 4.5M DE->EN sets.
- Small vocabularies seem to be sub-optimal. This could be due to the tradeoff between classifier and auto-regressor depending on the F95% value. A larger value of F95% is favorable to the classifier but not to the auto-regressor.
- Larger vocabularies are expected to achieve the best evaluation results. But this doesn’t hold true. The larger vocabularies (32K and beyond) have a smaller µ which favors the auto-regressor but due to lower F95% and a higher D unfavorable to the classifier, the trade-off doesn’t result in optimal results.
- On small (30K) to medium (1.3M) data sizes, the vocabulary size of 8K seems to find a good trade-off between µ and D, as well as between µ and F95%.
- For Transformer NMT, the heuristic is to use the largest possible BPE vocabulary such that at least 95% of classes have 100 or more examples in training.