CLOct 3, 2025

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

arXiv:2510.02938v14.91 citationsh-index: 13Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the need for reliable evaluation standards in conversational data retrieval for product analysis, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the problem of retrieving conversation data for product insights by creating the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set with 1.6k queries and 9.1k conversations, and found that even the best embedding models achieve only around NDCG@10 of 0.51, highlighting a significant performance gap.

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

View on arXiv PDF Code

Similar