ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval
This addresses the challenge of processing ambiguous user expressions in multimodal image retrieval systems, though it appears incremental as it adapts existing query rewriting techniques to a new domain.
The paper tackles the problem of image retrieval struggling with long or unclear natural language queries by introducing conversational query rewriting (CQR) to generate concise, semantically complete queries from dialogue histories, resulting in a curated dataset of approximately 7,000 multimodal dialogues that significantly enhances retrieval accuracy.
With the rise of multimodal learning, image retrieval plays a crucial role in connecting visual information with natural language queries. Existing image retrievers struggle with processing long texts and handling unclear user expressions. To address these issues, we introduce the conversational query rewriting (CQR) task into the image retrieval domain and construct a dedicated multi-turn dialogue query rewriting dataset. Built on full dialogue histories, CQR rewrites users' final queries into concise, semantically complete ones that are better suited for retrieval. Specifically, We first leverage Large Language Models (LLMs) to generate rewritten candidates at scale and employ an LLM-as-Judge mechanism combined with manual review to curate approximately 7,000 high-quality multimodal dialogues, forming the ReCQR dataset. Then We benchmark several SOTA multimodal models on the ReCQR dataset to assess their performance on image retrieval. Experimental results demonstrate that CQR not only significantly enhances the accuracy of traditional image retrieval models, but also provides new directions and insights for modeling user queries in multimodal systems.