LG AIMay 8

Interactive Critique-Revision Training for Reliable Structured LLM Generation

Fei Xu Yu, Zuyuan Zhang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

arXiv:2605.0832788.8

AI Analysis

For practitioners needing auditable LLM outputs in structured decision-making (e.g., form filling, compliance), this method provides a principled training framework to improve local correctness and global consistency, though improvements are demonstrated on a single benchmark.

The paper introduces DPA-GRPO, a paired-action training method for a generator-verifier game that improves structured LLM generation reliability. On TaxCalcBench TY24, DPA-GRPO with Qwen3-4B/8B achieves higher structured decision accuracy than zero-shot and generator-only RL baselines, increasing correct silent acceptance and reducing missed errors.

In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze the unregularized game and show that positive probability on strictly lower-reward intervention or revision actions creates a profitable unilateral deviation. Under standard stochastic-approximation assumptions, DPA-GRPO tracks the corresponding game ODE, whose isolated asymptotically stable limit points are stationary and candidate local equilibria under role-wise local optimality. Experiments on TaxCalcBench TY24 show that DPA-GRPO improves structured decision accuracy over zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B. Training increases correct silent acceptance, reduces missed errors, and improves calibrated revision behavior, indicating gains for both generator and verifier.

View on arXiv PDF

Similar