CL AIMay 19

Chunking German Legal Code

Max Prior, Natalia Milanova, Andreas Schultz

arXiv:2605.1980671.2

Predicted impact top 89% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For legal information retrieval practitioners, this paper provides empirical evidence that preserving domain-specific structure is more effective than complex semantic methods.

This paper evaluates chunking strategies for retrieval-augmented generation on German statutory law, finding that structure-aligned methods (section/subsection) achieve highest recall with better efficiency than complex LLM-based approaches.

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

View on arXiv PDF

Similar