CL CVJun 3

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou

arXiv:2606.0459197.6

Predicted impact top 4% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for fine-grained, coherent fragment retrieval in multi-modal dialogues, which is important for users of communication platforms but is an incremental extension of existing retrieval and reinforcement learning methods.

The paper introduces Fine-grained Fragment Retrieval (FFR) for retrieving coherent multi-utterance, multi-image fragments from multi-modal long-form dialogues, proposing F2RVLM (a generation-based retrieval model with reinforcement learning) for single-dialogue and FFRS (a two-stage system with fragment indexing) for corpus-level retrieval. Experiments on the new MLDR dataset and a WeChat test set show superior performance over baselines.

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

View on arXiv PDF

Similar