Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification
This work addresses the need for trustworthy AI in legal applications by assessing interpretability methods, though it is incremental as it focuses on comparative evaluation rather than introducing new techniques.
The study tackled the problem of evaluating interpretability methods for legal outcome prediction by comparing rationale extraction techniques on a new ECtHR dataset, finding that models' reasons for predictions differ from legal experts' judgments despite high faithfulness scores.
Interpretability is critical for applications of large language models (LLMs) in the legal domain, where trust and transparency are essential. A central NLP task in this setting is legal outcome prediction, where models forecast whether a court will find a violation of a given right. We study this task on decisions from the European Court of Human Rights (ECtHR), introducing a new ECtHR dataset with carefully curated positive (violation) and negative (non-violation) cases. Existing works propose both task-specific approaches and model-agnostic techniques to explain downstream performance, but it remains unclear which techniques best explain legal outcome prediction. To address this, we propose a comparative analysis framework for model-agnostic interpretability methods. We focus on two rationale extraction techniques that justify model outputs with concise, human-interpretable text fragments from the input. We evaluate faithfulness via normalized sufficiency and comprehensiveness metrics, and plausibility via legal expert judgments of the extracted rationales. We also assess the feasibility of using LLM-as-a-Judge, using these expert evaluations as reference. Our experiments on the new ECtHR dataset show that models' "reasons" for predicting violations differ substantially from those of legal experts, despite strong faithfulness scores. The source code of our experiments is publicly available at https://github.com/trusthlt/IntEval.