CLAug 8, 2023

Collective Human Opinions in Semantic Textual Similarity

Yuxia Wang, Shimin Tao, Ning Xie, Hao Yang, Timothy Baldwin, Karin Verspoor

arXiv:2308.04114v121.5136 citationsh-index: 69Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a foundational issue in NLP for researchers and practitioners by highlighting the limitations of existing STS benchmarks, though it is incremental as it focuses on dataset creation and analysis rather than a new method.

The paper tackles the problem of semantic textual similarity (STS) benchmarks using averaged human ratings, which masks true opinion distributions and prevents models from capturing semantic vagueness, by introducing USTS, an uncertainty-aware dataset with ~15,000 Chinese sentence pairs and 150,000 labels, showing that current STS models fail to capture human disagreement variance.

Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as the gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgements adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.

View on arXiv PDF Code

Similar