As we have mentioned in previous posts in this series, getting terminology right is a research challenge yet to be fully solved in Machine Translation (MT), and a common request made by clients to MT providers. We have revisited different proposed solutions, such as using annotations (#122) or ways of integrating dictionaries (#123). In today’s post we look at the work by Jon et al. (2021), who propose a learned constrained decoding method using target lemma annotations, to allow the MT system to learn the right lexical correspondence and produce the right inflection for each term. This is particularly relevant for morphologically rich languages.
Lexically constrained MT consists of forcing the appearance of a certain term on the target side, given the presence of its corresponding term in the source text. This approach has proven to work relatively well, with one caveat: it fails to deal with word inflections correctly and hence while the translation of the term may be “correct”, it may not agree with the rest of the words surrounding it in the sentence. This issue is exacerbated when dealing with morphologically rich languages, as the number of possible inflections of a single term is greater.
Jon et al. (2021) focus on Czech, a morphologically rich language, as a target language and investigate mechanisms to use lemma annotations for specific terminology, which would allow the MT system to infer the correct word inflection in each case. Their approach seems to reduce the number of inflection errors while overall quality of the MT does not get negatively impacted by the lexical constraints used.
Traditionally, there are 3 main ways to integrate terminology in MT:
- Post-processing the output to enforce the presence of specific terms;
- Constrained decoding (manipulating the decoding process to add constraints to the final translation); and
- Learned constraining (adding the constraints to the input sentence prior to training, so that after training the model gets biased towards utilizing them).
Jon et al. (2021) propose a learned constraining method whereby they concatenate the desired target lemmas to the source sentences of the training corpus. This process, which simplifies the data preparation process prior to MT training, yielded promising results in their experiments, while they acknowledge a slight performance decrease.
The proposed method is similar to that proposed, almost concurrently, by Bergmanis and Pinnis (2021), with one main difference: the way in which they integrate the constraints in the model. Bergmanis and Pinnis (2021) directly annotate the source tokens with lemma translations by means of factors, while Jon et al. concatenate the lemmas to the training data.
The authors carry out various experiments and also use two different types of test sets. The first is an oracle test set, where the constraints are obtained from an English-Czech dictionary and where both source and target sides are present in the sentence pair. This test set, extracted from newstest-2020, is used to measure the ability of integrating the constraints in the model. The second test set aims at mimicking a more realistic use case, and was extracted from Europarl. In this case, the authors use official terminology for EU-related expressions.
They also trained two different sets of constrained models. The first set of models constrains the original surface forms of the translations, whereas the second set constrains the lemmatized form of the terms instead. In doing so, the authors attempt to measure the ability of the model to generate the correct surface forms, given a constraint using a lemma instead of a surface form.
The following models were trained in each case:
- A baseline engine with no constraints.
- Random sampling models: Models where random subsequences of target tokens were sampled and subsequently constrained. Different models were trained using either surface or lemmatised forms of the entries in the dictionary.
- Dictionary models: Models where the entries extracted in the dictionary are constrained. Again different models were trained constraining either the surface forms or the lemmas.
- Dictionary, skip half models: Models where constraints were only applied to half of the training data.
The authors compute the automatic evaluation metric BLEU, but also the coverage of the terms being targeted in each case. As they make a distinction between surface forms and lemmas, they also compute the coverage of the two.
They observe that the best model is the one trained with constraints based on dictionary searches in the case of the oracle constraint test set, for which the constraints are generated in the same way. Besides that, it also seems that BLEU and coverage drop substantially when no constraints are supplied. In this test set, the coverage of surface forms is up to 93% in the case of models trained with surface forms, and that coverage drops to 61-68% when using lemmatized constraints. However, this was to be expected, as these models were trained aiming at reproducing the surface forms given the constraints.
When looking at the terminology integration, it seems that generating the correct constraint form is challenging for the model if the surface form is different from the one provided in the input. To properly assess this, they split the Europarl test set into two categories: sentences that contain the same surface form in the reference as it was in the constraint, and sentences whose term surface form differs from the constraint. While 44% of the surface forms were covered, if we look at the lemma coverage, this percentage goes up to 96.6%. This difference seems to be more pronounced when the surface form of the term is different. A manual human evaluation, however, showed that in 92% “of the cases marked as not covered when using the lemmatized model, the form of the constraint is different from the reference, but correct given the context, as the model translates the sentences differently (but correctly).”
Finally, when using constrained decoding to compare their proposed method against another state-of-the-art method, they evaluated lexically constrained decoding on the Europarl test sets. Their results show that while constrained decoding produces the constraints in the output as expected, but precisely because of that it fails at reproducing the surface forms when they are different from the constraint used.
Jon et al. (2021) propose a relatively simple way to use learned constraints at training that seems to be promising, particularly for morphologically rich languages. Their approach is to a certain extent similar to other approaches making use of annotations to help MT models learn translations, and hence differs from constrained decoding that forces the appearance of a certain term in the output. This has advantages and disadvantages: on the one hand, the model learns to translate terms correctly without having to constrain all individual word forms of a term. On the other hand, it also fails at times, since we are effectively trying to “teach” the model how to translate a certain term, but the model is the one that ultimately “decides” which translation is used at decoding time.
Whether this is good enough for real production settings would still need to be assessed. Would a client agree to have terms translated correctly the majority of the time, with some mistranslations every now and then, which would be the case if such models were to be used? Of course, another influencing factor would be the use-case of the MT. This is indeed an interesting approach, but when it comes to translating terminology, the main question is: how good is good enough?