BLEU score
Description
Developed by IBM, BLEU is one of the most widely used automatic evaluation metrics in Natural Language Processing (NLP). It functions by measuring the similarity between the machine-generated text and a professional human translation. Specifically, it calculates the overlap of words and short phrases (n-grams) between the two.
While BLEU is valued for its speed, low cost and language independence, it has known limitations. It focuses on exact word matches rather than meaning or fluency. A translation could use synonyms or a different sentence structure that is perfectly valid but receive a low BLEU score because it doesn't match the reference exactly. For this reason, BLEU is rarely used as the sole indicator of quality. It is typically used for benchmarking multiple MT engines against each other or tracking the progress of a model during training. In modern workflows, it is often complemented by newer, semantic-aware metrics like TER (Translation Edit Rate) or COMET to provide a more holistic view of performance.