CLAINov 16, 2023

BLT: Can Large Language Models Handle Basic Legal Text?

arXiv:2311.09693v326 citationsh-index: 60
Originality Synthesis-oriented
AI Analysis

This addresses the reliability of LLMs for legal professionals, but it is incremental as it focuses on a specific benchmark and fine-tuning approach.

The paper tackles the problem of large language models (LLMs) performing poorly on basic legal text tasks, such as looking up specific lines in documents, and finds that fine-tuning a small model on their training set achieves near-perfect performance.

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes