Edit distance: not a miracle cure

Carla Schelfhout 02 Feb 2022 4 mins
How Terminology Impacts SEO, MT and Social Media
Machine translation has very much become mainstream and is used for a wide variety of purposes, content types and languages. This has led to a growing demand for fast validation of MT quality. This is particularly relevant for the localization industry where the main use case is post-editing. There is a need to predict the suitability of MT for post-editing purposes in a way that is fast, automated but still reliable. 
 
Against a background where the value of the classic BLEU metric is increasingly under question, this brings a spotlight to the group of edit distance metrics and their usefulness for assessing or predicting post-edit suitability. When it comes specifically to reimbursing post-editors, edit distance metrics are seen as a reliable indicator of post-editing effort. This is not necessarily a new idea, however there is not yet an industry-recognized process for using edit distance for this purpose. Why is that? The apparent ease of this solution in fact hides a number of complexities. Let’s take a closer look.

What is edit distance?

First of all, a few words on edit distance. There are various metrics that measure edit distance, with Levenshtein distance and TER(p) among the best-known. In essence, all edit distance metrics work on the same principle: they measure the minimal number of edits necessary to change one string (in this case: the MT output) into another string (in this case: the final translation). An edit can be an insertion, a deletion, or a substitution. 
 
For example, the Levenshtein distance between "table" and "task" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
 
1. table → tabl (deletion of “e")
2. tabl → tabk (substitution of "l" for "k")
3. tabk → task (substitution of "s" for "b")
 
For many purposes, it makes a difference if there are 3 edits in a string of 3 words or in a string of 13 words. Hence the number of edits is often divided by the string length, in which case a number between 0 and 1 results (usually displayed as a number between 0 and 100). This is called ‘normalization’ and it makes edit distance values more comparable among strings of different lengths.

Can edit distance help to select the best MT quality?

Let’s take a look at the scenario of selecting the best MT model to use in production. In order to do so, users need a way to tell which model out of a set of models performs best. Let’s assume that the post-editors have performed all and only necessary changes to the MT output – in itself already a big assumption. On that basis, it can be assumed that the engine that required the least edits to the same set of source sentences is the most suited to the post-editing use case.
 
This is not necessarily true. The problem here is that tests must be quickly repeatable and automated. This only works if you have a frozen set of gold standard translations, to which you can compare new MT output automatically. However, most source sentences can have more than one valid translation. A good post-editor will opt for the one that is closest to the MT output to minimize the editing effort. Freezing one set of final translations as gold standard penalizes MT output which would require similar or less PE effort but would result in a different final translation.
 
An artificial example can illustrate the principle
 
Gold standard translation: The cat bit the dog.
  • MT1: The cat hit the dog. Edit distance to gold standard: 1.
  • PE1: The cat bit the dog. Edit distance from MT1 to PE1: 1.
  • MT2: The dog was bitten by the cat. Edit distance to gold standard: 16
  • PE2: The dog was bitten by the cat. Edit distance from MT2 to PE2: 0
 
MT2 is closer to the original meaning than MT1, and would require less post-editing effort in a live scenario, but is penalized because it chooses a different construction than the gold standard.
 
Edit distance provides most information when new MT output is post-edited every time. While this is technically possible, it is so labor-intensive and time-consuming that it is not a practical solution for the quick selection of a model. There are workarounds of course, but the right balance between speed and accuracy needs consideration for every use case.

Can edit distance be used to measure post-editing effort?

Another way to utilize edit distance is as an indication of post-edit effort. The reasoning is that fewer edits indicate less effort on the post-editor’s part – and hence the reward should be less, as the post-editor is only paid for the work that was actually carried out. This is not entirely true though, as will become clear when the tasks of translation and post-editing are broken down into their parts:
When the MT output is closer to a good translation, fewer keystrokes are necessary to arrive at the final translation than when the final translation has to be typed from scratch. That means that the “typing” step marked in red can be done faster. However, all other steps in the overall task still need doing, and the yellow step is in fact added compared to conventional translation. When for post-editing only 7 edits are needed while for translating 14 typing actions are required, this does not mean that post-editing overall is twice as fast. It only means that the typing task entails 50% fewer key strokes. Post-editing the job overall would be less than 50% faster – how much less cannot be determined from the edit distance alone.
 
Another example may clarify that. Let’s assume that work is being done for a client in the life science industry with a customized MT system. Typical source may look like this: CELLine™ 1000, a membrane-based, disposable cell cultivation system, guarantees high cell densities and is easy to use for recombinant protein expression and high yield monoclonal antibody (MAb) production. Given the need for exactness in this domain, the post-editor will need to validate all underlined technical terms in the TM and/or the term base. If the MT system provides the correct translation, no changes are necessary and edit distance will be 0. Does that mean that no effort was spent on this segment?
 
The proposal to reward post-editing by edit distance is based on a misconception: the idea that post-editing (and translation in general) is nothing more than typing. However, the key skill is to know what to type, and as a text becomes more specialist, the post-editor’s time is increasingly spent on validating the meaning and less on the actual keystrokes. Post-edit effort is determined by a combination of the domain, the post-editor’s experience, the MT output quality and the client’s quality expectations for the final delivery. Edit distance can only measure the changed characters, it is not a one-on-one indicator of total effort spent. While there could be ways to work around this complexity when reimbursing post-editing, it is not a straightforward approach.
 
Of course this does not mean that edit distances should have no place in the overall post-edit process. When edit distance metrics are for example tracked across the lifecycle of projects, any changes to a familiar pattern can flag up issues. If edit distance metrics are very high or very low, this will also raise flags. Monitoring the health of a project in this way can be done in a quick and automated way and provides more insight than random spot-checks of individual post-editing jobs.
 
In this scenario, the cause of any flags such as under-editing or unusual source will still need investigation. Here, the edit distance metric does not provide the answer but raises the question.
 
Edit distance metrics are a great way to measure the delta between MT output and final translation, but in order to draw any conclusion from these numbers, their context needs to be fully understood. Edit distances are not the be-all and end-all on their own, but provide most value when embedded in a carefully designed process related to a specific purpose. Generating edit distance metrics is fairly straightforward but designing and running meaningful processes on the basis of these metrics requires careful thought and application.
 
Click here if you would like to learn more about our post-editing capabilities.
Carla Schelfhout
Author

Carla Schelfhout

Director of Linguistic AI Solutions
Based in the Netherlands, Carla received her PhD in computational linguistics at Nijmegen University. Since joining RWS in 2007, she has focused primarily on machine translation and how to create, customize and evaluate best-of-breed machine translation solutions for a variety of clients and use cases.
All from Carla Schelfhout