AIAug 5, 2025

ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts

Shuang Liu, Zelong Li, Ruoyun Ma, Haiyan Zhao, Mengnan Du

arXiv:2508.03080v15 citationsh-index: 17Has CodeProceedings of the Natural Legal Language Processing Workshop 2025

Originality Incremental advance

AI Analysis

This provides a benchmark for developing legal-domain LLMs, addressing the need for data confidentiality in legal risk analysis, though it is incremental as it builds on existing datasets and evaluation methods.

This paper tackles the problem of evaluating whether open-source large language models can match proprietary models for identifying legal risks in commercial contracts, finding that proprietary models generally outperform open-source ones in correctness and effectiveness, though some open-source models are competitive in specific areas.

The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning ("thinking") mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate "no related clause" responses more frequently even when relevant clauses are present. This suggests "laziness" in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.

View on arXiv PDF

Similar