CLNov 21, 2023

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Amazon
arXiv:2311.12534v11 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation metrics in synthetic traffic generation, which is crucial for training QA systems and conversational agents, though it is incremental as it builds on existing NLG evaluation challenges.

The paper tackled the problem that common NLG metrics like BLEU are unsuitable for evaluating synthetic traffic generation tasks, and proposed new metrics that improved agreement with human judgment by up to 20% across three tasks.

Many Natural Language Generation (NLG) tasks aim to generate a single output text given an input prompt. Other settings require the generation of multiple texts, e.g., for Synthetic Traffic Generation (STG). This generation task is crucial for training and evaluating QA systems as well as conversational agents, where the goal is to generate multiple questions or utterances resembling the linguistic variability of real users. In this paper, we show that common NLG metrics, like BLEU, are not suitable for evaluating STG. We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts. We validate our metrics with an automatic procedure to verify whether they capture different types of quality issues of generated data; we also run human annotations to verify the correlation with human judgements. Experiments on three tasks, i.e., Shopping Utterance Generation, Product Question Generation and Query Auto Completion, demonstrate that our metrics are effective for evaluating STG tasks, and improve the agreement with human judgement up to 20% with respect to common NLG metrics. We believe these findings can pave the way towards better solutions for estimating the representativeness of synthetic text data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes