Getting "Real-World" ReadyIn the previous 44 issues of this series, we have reviewed many cutting-edge approaches in Neural MT research and development. However, we are not just paying lip service. We practice what we preach and look to incorporate as many of these as possible into our own software. That being said, many of these approaches are prototypical, or at an early research stage. As a commercial provider of machine translation, we need to rigorously test the implementations to make sure they stand up to the challenges and variables of production deployment. Let's look at some of the steps we implement and why.
We've all heard the old adage "garbage in, garbage out" when it comes to MT. So important is this topic with Neural MT, we covered it in our second ever post! In our MT Summit paper, we describe the steps we take, not only to clean and filter noisy data, but additional processing in order to make sure certain entities and formatting are retained correctly in the translation output.
On the cleaning side, this includes the normalisation of character encodings, removal of duplicate or overly long entries, and perhaps not so obviously, filtering of segments that are actually in the wrong language, which can be disturbingly common. In addition to that, we carry out steps to retain or process numbers and punctuation in the correct manner, depending on the language pair. Finally, we implement some proprietary steps to ensure that certain terms are translated in a specific manner and/or not translated (so-called 'Do not translates').
While it is difficult to fully quantify the positive impact of these steps - especially when it comes to preserving terminology in certain applications - we broadly see an improvement of 3 BLEU points across generic data sets, as well as a significant reduction in common errors like over- and under-generation.
Tokenisation and Subword Encoding
Statistical MT was quite robust with non-standard inputs that we maybe hadn't seen in our training data, but Neural MT is much more unpredictable. Entities such as email addresses, URLs, tags, and lists, which are clearly important in the documents where they are found, need to be handled directly in order to ensure they are processed/translated correctly. This involves the application of specific tokenisation, not only for the language, but to address preservation of these entities. In order to improve vocabulary coverage, particularly with Asian languages, we also employed a version of subword encoding, which is a topic we covered in Issues #3 and #12.
We found that, from a practical perspective, automatic quality metrics were not sensitive enough to pick up the subtleties of the differences in the output, but extensive manual evaluation showed that such entities were preserved in 100% of the cases, which significantly decreases in out-of-vocabulary terms, varying somewhat depending on the nature of the test sets.