CLMay 22, 2025

GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

arXiv:2505.17267v311 citationsh-index: 29EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better benchmarks in legal AI, though it is incremental as it builds on existing LLM-as-a-judge methods.

The authors tackled the problem of evaluating large language models (LLMs) on free-text legal reasoning and citations by introducing GreekBarBench, a benchmark based on Greek Bar exams, and found that while the best models outperformed average expert scores, they did not reach the 95th percentile of experts.

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes