CL AINov 16, 2023

BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

arXiv:2311.09693v39.426 citationsh-index: 60Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the reliability of LLMs for legal professionals, but it is incremental as it focuses on a specific benchmark and fine-tuning approach.

The paper tackles the problem of large language models (LLMs) performing poorly on basic legal text tasks, such as looking up specific lines in documents, and finds that fine-tuning a small model on their training set achieves near-perfect performance.

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.

View on arXiv PDF Code

Similar