Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?
This provides an effective and resource-efficient solution for enhancing Thai legal LLMs, addressing a domain-specific bottleneck in legal reasoning and question answering.
The paper tackled the problem of limited performance of Retrieval-Augmented Generation (RAG) systems on Thai legal question answering, especially for complex reasoning, by introducing Group-Relative Policy Optimization (GRPO) with BGE-M3 embeddings. The result was up to 90% citation-F1 gains and a 31% increase in joint quality metrics over instruction tuning, with 2.5x computational cost reduction.
The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.