LGAIMay 8

Interactive Critique-Revision Training for Reliable Structured LLM Generation

arXiv:2605.0832788.8
AI Analysis

For practitioners needing auditable LLM outputs in structured decision-making (e.g., form filling, compliance), this method provides a principled training framework to improve local correctness and global consistency, though improvements are demonstrated on a single benchmark.

The paper introduces DPA-GRPO, a paired-action training method for a generator-verifier game that improves structured LLM generation reliability. On TaxCalcBench TY24, DPA-GRPO with Qwen3-4B/8B achieves higher structured decision accuracy than zero-shot and generator-only RL baselines, increasing correct silent acceptance and reducing missed errors.

In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze the unregularized game and show that positive probability on strictly lower-reward intervention or revision actions creates a profitable unilateral deviation. Under standard stochastic-approximation assumptions, DPA-GRPO tracks the corresponding game ODE, whose isolated asymptotically stable limit points are stationary and candidate local equilibria under role-wise local optimality. Experiments on TaxCalcBench TY24 show that DPA-GRPO improves structured decision accuracy over zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B. Training increases correct silent acceptance, reduces missed errors, and improves calibrated revision behavior, indicating gains for both generator and verifier.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes