REC-RL: Referring expression counting via Gaussian and range-based reward optimization
For researchers in vision-language reasoning, this work improves REC by explicitly optimizing the reasoning process without extra annotations, though it is an incremental improvement over existing methods.
The paper tackles referring expression counting (REC) by proposing a reinforcement learning framework (REC-RL) that optimizes intermediate reasoning via Gaussian and range-based rewards, achieving consistent improvements over strong baselines and robust generalization across benchmarks.
Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.