CL AIApr 29, 2025

Automatic Legal Writing Evaluation of LLMs

Ramon Pires, Roseval Malaquias Junior, Rodrigo Nogueira

arXiv:2504.21202v116.310 citationsh-index: 5Has CodeICAIL

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating domain-specific legal writing for researchers and practitioners, though it is incremental as it applies existing methods to a new dataset.

The authors tackled the scarcity of benchmarks for evaluating legal writing by LLMs by introducing oab-bench, a dataset based on the Brazilian Bar Examination, and found that Claude-3.5 Sonnet achieved an average score of 7.93 out of 10, passing all exams, while frontier models like OpenAI's o1 showed strong correlation with human scores as automated evaluators.

Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

View on arXiv PDF Code

Similar