CLApr 20

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

arXiv:2604.1887811.6h-index: 2

AI Analysis

For researchers and practitioners in Brazilian legal NLP, this benchmark reveals that general-purpose LLMs cannot substitute for domain-adapted models even in simple classification tasks, and provides a reproducible baseline.

LegalBench-BR is the first public benchmark for Brazilian legal text classification, showing that fine-tuned BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming GPT-4o mini by 28pp and Claude 3.5 Haiku by 22pp, while commercial LLMs exhibit systematic bias toward civil law.

We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

View on arXiv PDF

Similar