CLMay 28

Metric-Dependent Annotation Saturation for Learning from Label Distributions

arXiv:2605.2979771.6

Predicted impact top 89% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners collecting annotations, this work demonstrates that annotation budgets should be tailored to the target evaluation metric rather than set uniformly.

The paper shows that the number of annotators needed to capture disagreement in label distributions depends on the evaluation metric: entropy correlation requires 20-50 annotators to converge, while KL divergence saturates at 10. Soft labels outperform label smoothing (r=0.643 vs. 0.45-0.49) across architectures and domains.

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

View on arXiv PDF

Similar