Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
This work addresses the problem of limited annotated data for dialogue grounding in AI, which is incremental as it builds on existing methods to enhance data synthesis.
The paper tackles the challenge of distribution shift and data scarcity in Dialogue-Based Generalized Referring Expression Comprehension by introducing a three-tier data-synthesis framework, resulting in consistent and substantial improvements over prior approaches across standard evaluation metrics.
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.