BLEU, an acronym for "Bilingual Evaluation Understudy", is a measurable statistic used to evaluate the accuracy of translations produced by machines as against those made by human translators. One prevalent instance of BLEU NLP is the one developed by IBM, extensively employed for scrutinizing data and assessing the quality of translations made by machines.

In the process of BLEU, machine-produced translations are juxtaposed with one or more reference translations by assessing their n-grams, or strings of words. Based on this comparative evaluation, the machine-made translation is scored within a range of 0 to 1, with 1 implying perfect alignment with the reference translation. The degree to which the machine translation aligns with the n-gram frequency spread of the original text is an indicator of translational accuracy.

The BLEU metric might not be entirely accurate while comparing translations between languages with distinct grammar systems or word structures. Nonetheless, its convenience and simplicity ensure that it remains extensively used as a benchmark for assessing machine translation.

Steps to compute BLEU score:

Identify the n-gram precision - Calculate n-gram accuracy by totaling the number of matching n-grams that are present in both machine-made translation and reference translations. Subtract this from the total number of n-grams in machine-created translation.

Implement the brevity penalty - For translations that are shorter than source translations, a penalty is imposed. The brevity penalty is calculated by dividing the length of machine-generated translation by the length of the shortest reference translation.

Combine n-gram accuracy measures - The geometric mean of n-gram precisions identifies a translation score reflecting the alignment of the translation with source translations regarding n-gram accuracy.

Determine the final BLEU score - The total BLEU score is obtained by multiplying the combined n-gram accuracy with the brevity penalty.

The formula for BLEU score computation:

BLEU = brevity_penalty * exp(sum(w_n * log(p_n)))

Explanation of variables:

brevity_penalty: Already explained above.

w_n: The weight assigned to each n-gram accuracy score. The weight is often considered as 1/n, where n denotes the quantity of used n-gram sizes.

p_n: Refers to the precision score for the n-gram size.

BLEU scores come in two varieties: Cumulative and Individual.

Individual BLEU is independently computed for each source translation and then averaged. This is especially helpful when multiple reference translations exist for a single source text. Cumulative BLEU adds up all the n-gram accuracy scores from reference translations and computes the geometric mean. This method is best applicable when a single reference translation is available.

In conclusion, while BLEU is a useful tool for evaluating translation efficiency, it should be combined with diverse analytical metrics and human evaluation for complete translation quality understanding.

Integrate | Scan | Test | Automate

Detect hidden vulnerabilities in ML models, from tabular to LLMs, before moving to production.