LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights
This provides a new benchmark for assessing legal reasoning capabilities in AI, specifically for EU law, but is incremental as it builds on existing legal evaluation frameworks.
The authors tackled the problem of evaluating legal reasoning in Large Language Models by creating a new task and dataset (LAR-ECHR) based on European Court of Human Rights cases, finding that even the best model achieved 75.8% accuracy, indicating room for improvement.
We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.