Understanding and Predicting Human Label Variation in Natural Language Inference through Explanation
This addresses the need for robust and trusted NLP models by focusing on annotation disagreement, but it is incremental as it builds on prior work on human label variation.
The paper tackled the problem of human label variation in NLP by creating LiveNLI, a dataset with 122 items and at least 10 annotations each, including highlights and free-text explanations, and found that GPT-3's ability to predict label distributions using chain-of-thought prompting still has room for improvement.
Human label variation (Plank 2022), or annotation disagreement, exists in many natural language processing (NLP) tasks. To be robust and trusted, NLP models need to identify such variation and be able to explain it. To this end, we created the first ecologically valid explanation dataset with diverse reasoning, LiveNLI. LiveNLI contains annotators' highlights and free-text explanations for the label(s) of their choice for 122 English Natural Language Inference items, each with at least 10 annotations. We used its explanations for chain-of-thought prompting, and found there is still room for improvement in GPT-3's ability to predict label distribution with in-context learning.