Issue #12 - Character-based Neural MT
Most flavours of Machine Translation naturally use the word as the basis for learning models. Early work on Neural MT that followed this approach had to limit the vocabulary scope for practical reasons. This created problems when dealing with out-of-vocabulary words. One approach that was explored to solve this problem was character-based Neural MT. With the emergence of subword approaches, which almost solves the out-of-vocabulary issue, the interest in character-based models declined. However, there has been renewed interest recently, with some papers showing that character-based NMT may be a promising research avenue, especially in low-resource conditions.
Characters and Sub-wordsThe results obtained by Cherry et al. (2018) are straightforward to apply to any NMT setting because - unlike in most character-based models - they used the same engine as for (sub)word-based NMT, without adapting the architecture to the character-based scenario. Translating characters instead of subwords improves generalisation and simplifies the model through a dramatic reduction of the vocabulary. However, it also implies dealing with much longer sequences, which presents significant modelling and computational challenges for sequence-to-sequence neural models.
Cherry et al. (2018) compare subword-based and character-based translation with a standard recurrent neural network (RNN), on four language directions: English into French, and German, Czech and Finnish into English. They observe consistent improvements with the character-based engine over the subword-based engine in terms of BLEU score on the four language directions, provided the RNN is deep enough. They use a deeper RNN than in common practice, with 6 bidirectional layers in the encoder (which generates hidden states from the source vectors) and with 8 unidirectional layers in the decoder (which generates target vectors from the hidden states). They argue that in previous research in which character-based models were not competitive with (sub)word-based models, shallower networks were used. They also show that when decreasing the number of layers, the advantage of the character-based model decreases.
The bigger, the better?
Interestingly, it appears that the gains achieved with the character-based model decrease when the corpus size increases. The results suggest that in the English-French direction, the advantage of the character-based model would disappear after 60-70 million sentence pairs. However, intuitively, we can think that this amount could be higher for translation into morphology-richer languages. Of course, this is a massive data set so this might not be relevant for many custom MT deployments.
The qualitative analysis of errors committed by both approaches reveals some interesting figures. According to this manual analysis of 100 translations from German into English, the character-based engine appears to be more faithful to the source, sometimes being too literal. A side-effect of this is the absence of dropped content, a well-known issue in NMT. While in 7 sentences, content of the source was not translated by the subword-based engine, this does not occur in any of the character-based output. Another benefit of the character-based model is a better lexical choice (8 errors in this category compared to 19 with the subword-based model), including a better handling of German compounds (1 error versus 13).
When all parameters are equal, character-based models have a significant cost in terms of training speed. With the number of layers used in the paper, the training of the character-based model takes 8 times longer. However, given the drop in vocabulary size, some parameters could be reduced, making the difference less dramatic.Although character-based models present computational and modeling challenges, these results suggest that it is an interesting avenue to explore, especially in a low-resource scenario or when translating into highly inflected languages. Character-based models could also be a solution to issues like under- or over-generation in NMT.