Issue #53 - Using Fuzzy Matches in Neural MT
Introduction
In Computer-Assisted Translation (CAT) workflows, the Machine Translation (MT) system is often used as a backoff mechanism when the Translation Memory (TM) fails to retrieve high fuzzy matches above a certain threshold. This is not always the most optimal TM-MT combination strategy. In this post, we'll discuss an advanced integration method proposed by Bulté and Tezcan (2019).
TM and Fuzzy match retrieval
In the retrieval process, each source sentence is compared to all other source sentences. The source sentences that match a given sentence with a similarity score higher than the specified threshold are added in the fuzzy-match-list with corresponding target.
In the paper, Bulté and Tezcan (2019) used token-based edit distance as the primary metric for similarity scores. Since extracting the fuzzy matches from a large TM using edit distance is computationally expensive, they improve it by using the SetSimilaritySearch algorithm. Using SetSimilaritySearch, for each source sentence they first extract a set of n-best candidates, then the multithreaded edit distance match is applied on this n-best list only. With the improved method, they were able to reduce the fuzzy matching time on the training set to 0.25% of the time it takes to extract matches using only edit distance.
Source augmentation
For each source sentence s1 for which at least one sufficiently high-scoring match is found in the TM an augmented source xi is generated using one of the following formats, while preserving the target t i-Integration
Bulté and Tezcan (2019) experimented with two methods for integrating the augmented training data in the NMT process:
- Two separate NMT systems, a backoff NMT system with original training data, and a primary NMT system using only augmented data.
- One unified system that uses the union of original data and augmented training data.
In the first method the backoff system is used when there is no fuzzy match for a given source as the primary system is only capable of translating the augmented source. Whereas in the unified method both types of test segments are translated using the single NMT system.
How effective it is?
To evaluate the effectiveness, they experimented with two language pairs: English into Dutch (EN-NL) and English into Hungarian (EN-HU). Using the backoff method, the format 1 n-best was found to be optimal, whereas format 3 works best in a unified approach.
Using the source augmentation, for EN-NL they were able to gain +3.19 BLEU over the best baseline including Google Translate. The improvements for EN-HU are even larger of +7.46 BLEU points.