CLMar 1, 2024

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

arXiv:2403.00553v219.867 citationsh-index: 52Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of standardizing text diversity measurement for researchers and practitioners in NLP, though it is incremental as it builds on existing scores and tools.

The paper tackled the lack of a standard method for measuring text diversity in LLM outputs by investigating existing scores and releasing an open-source tool, finding that fast compression algorithms capture similar information to slow n-gram scores and identifying a combination of measures with low mutual correlation for reporting.

The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.

View on arXiv PDF

Similar