CLMar 19, 2020

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

arXiv:2003.08529v1998 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in natural language processing by providing characteristic metrics for text collections, though it is incremental as it builds on existing statistical concepts.

The authors tackled the problem of insufficient quantitative metrics for describing text collections by proposing new metrics for diversity, density, and homogeneity to measure dispersion, sparsity, and uniformity. Experiments showed these metrics are highly correlated with BERT's text classification performance, potentially inspiring future applications.

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes