CLOct 18, 2023

From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification

Shanshan Xu, T. Y. S. S Santosh, Oana Ichim, Isabella Risini, Barbara Plank, Matthias Grabmair

arXiv:2310.11878v522.0138 citationsh-index: 13

Originality Incremental advance

AI Analysis

This work addresses the issue of human label variation in legal NLP, which is crucial for improving trustworthiness and explainability in case outcome classification, though it is incremental as it builds on existing explainable COC methods by focusing on multi-expert annotations.

The study tackled the problem of disagreements among legal experts in constructing rationales for case outcome classification by introducing a novel dataset, RAVE, and analyzing the sources of disagreement, finding that underspecification of legal context is a major challenge and that state-of-the-art models show limited agreement with experts.

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.

View on arXiv PDF

Similar