Monolingual language models were a critical part of Phrase-based Statistical Machine Translation systems. They are also used in unsupervised Neural MT systems (unsupervised means that no parallel data is available to supervise training, in other words only monolingual data is used). However, they are not used in standard supervised Neural MT engines and training language models have disappeared from common NMT practice. Two recent papers suggest that language models may soon be back in supervised MT. Devlin et al. (2018) improve several natural language processing tasks in English by using pre-trained bidirectional language models. Conneau and Lample (2019) extend this approach to multilingual language models and improve unsupervised and supervised MT. In this post, we will take a look at the approach of Devlin et al.
BERT (Bidirectional Encoder Representations from Transformers)
Standard language models read the text input sequentially (left-to-right or right-to-left, or both combined) and predict a word given the previous words in the sequence. For example in a left-to-right model, the probability of the word following “I drive my” depends on the sequence already read (“I drive my”). This makes them directional models. As a consequence, they can take into account only the left or right context of the word at the same time, but not both together. Devlin et al. build a language model from a transformer encoder, an attention-based neural network which reads the entire input sequence at once. Thus it is not directional (it is actually called bidirectional) and allows the model to learn the context of a word based on both its left and right surroundings.
However, standard language models trained from transformers must still be trained in a directional fashion. This is because in a transformer network all nodes are connected together. In bidirectional training, this would allow each word to indirectly “see itself” across different layers. This means that the probability of the word X in “I drive my X on the cycle track” would depend on the probabilities of network nodes involving that word.
Over-simplifying a bit, to calculate the probability of a word, we would need to know the probability of that word. In order to solve this problem, Devlin et al. mask a small percentage of the words (about 15%) and predict only those masked words. In this way they manage to train a bidirectional language model, which is the main innovation of their paper. Note that in our example, taking only the left context into account, “car” or “bike” may have equal probability to follow “I drive my”. The right context “on the cycle track” allows the model to disambiguate.
Another characteristic of the proposed model is the next sentence prediction. They input pairs of sentences to the model instead of single sentences, which is useful in tasks such as question-answering or textual entailment, in which a segment logically depends on another one. The resulting language model is called BERT (Bidirectional Encoder Representations from Transformers).
'Fine Tuning' from trained representations
In the field of computer vision, it is common practice to train neural networks on a known task and fine-tune these pre-trained networks on a different task. This approach is called transfer learning. Recently, several researchers have shown the value of this technique in natural language processing tasks.
In the present paper, the authors use the pre-trained BERT encoders and fine-tune them in several tasks. In sentence classification tasks, the output of the BERT networks are aggregated in a final hidden state, to which a classification layer appropriate to the task is added. In question-answering tasks, the span of the answer in the segment is calculated based on the final hidden states corresponding to each word. Then the probability of each word being the start or end of the span is calculated. With a large BERT network pre-trained on a large amount of data, Devlin et al. significantly improve the state of the art in many natural language tasks such as: question-answering, named-entity recognition, textual entailment, language inference, sentiment analysis, and semantic equivalence or linguistic acceptability.