Google proposes BLEURT indicators that can quantify the performance of NLG models

Over the past few years, the Natural Language Generation (NLG) model has evolved dramatically, and the ability to read, summarize texts, and engage in dialogue has improved dramatically. To make it easier for research teams to evaluate the performance of different NLG models, search giant Google has created a metric called BLEURT. Typically, we can evaluate NLG models, such as Bilingual Assessment Learning (BLEU), through manual and automated processes. The disadvantage of the former is that it is too labor-intensive, while the advantage of the latter is that it has higher accuracy.

Google proposes BLEURT indicators that can quantify the performance of NLG models

(From: MIT Tech Review)

According to Google researchers, BLEURT is a new automated evaluation metric for natural language models (NLGLs) that provide reliable ratings for different models, with results close to or even surpassing human metrics.

It is reported that the core of BLEURT is machine learning. The most important thing for any ML model is how rich the data for training is. For the NLG model, however, the training data is quite limited.

Google proposes BLEURT indicators that can quantify the performance of NLG models

In fact, in the WMT Metrics Task dataset, the largest collection of humans to converge, only about 260,000 data has been collected across the news domain.

If you use it as a unique training dataset, the WMT measurement task dataset loses the versatility and robustness of the training model. To solve this problem, the researchers took a diversional approach to learning.

First, the team used the context word of BERT, which was successfully aggregated into NLG quantization tools such as Yis and BERTscore.

Next, the researchers introduced a novel pre-training program to improve the robustness and accuracy of BLEURT while helping to cope with the quality offset of the model.

Google proposes BLEURT indicators that can quantify the performance of NLG models

BEFORE FINE-TUNING THE MANUAL QUANTIFICATION STANDARD, BLEURT “WARMED UP” THE NLG MODEL WITH MILLIONS OF SYNTHETIC SENTENCES. It generates training data through sentences from Wikipedia, plus random disturbances.

Instead of manually scoring, the team used a collection of indicators and models from the relevant literature, including BLEU, to expand the number of training samples at a very low cost, and then pre-trained BLEURT twice.

The goal of the first phase is language modeling, and the second phase is to evaluate the NLG model, after which the team fine-tuned the model on the WMT indicator dataset. Once trained, BLEURT will try to compete with the competition scheme to prove it due to current metrics.

Google proposes BLEURT indicators that can quantify the performance of NLG models

BLUERT is known to run on Python 3 and relies on TensorFlow, as detailed on the GitHub project introduction page (portal). For more information on this study, you can read the preprint on ArXiv.

Finally, the researchers summarized other results, such as BLEURT’s attempt to “capture NLG quality beyond the overlap of surfaces,” an indicator that was evaluated by SOTA in two academic benchmarks.