Legal Mathematical Reasoning with LLMs: Procedural Alignment through Two-Stage Reinforcement Learning
This addresses the need for procedurally compliant reasoning in high-stakes legal applications, though it is incremental as it builds on existing reinforcement learning methods.
The authors tackled the problem of legal mathematical reasoning in LLMs by introducing LexNum, a Chinese benchmark, and LexPam, a two-stage reinforcement learning framework, which improved mathematical accuracy and legal coherence in experiments.
Legal mathematical reasoning is essential for applying large language models (LLMs) in high-stakes legal contexts, where outputs must be both mathematically accurate and procedurally compliant. However, existing legal LLMs lack structured numerical reasoning, and open-domain models, though capable of calculations, often overlook mandatory legal steps. To address this, we present LexNum, the first Chinese legal mathematical reasoning benchmark, covering three representative scenarios where each instance reflects legally grounded procedural flows. We further propose LexPam, a two-stage reinforcement learning framework for efficient legal reasoning training. Leveraging curriculum learning, we use a stronger teacher model to partition data into basic and challenging subsets. A lightweight 1.5B student model is then fine-tuned with Group Relative Policy Optimization, which avoids costly value networks and enables stable training from sparse, end-of-sequence rewards. The first stage improves accuracy and format; the second introduces a novel reward to guide procedural alignment via task-specific legal elements. Experiments show that existing models perform poorly on LexNum, while LexPam enhances both mathematical accuracy and legal coherence, and generalizes effectively across tasks and domains.