CLOct 27, 2025

IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering

Jieyong Kim, Maryam Amirizaniani, Soojin Yoon, Dongha Lee

arXiv:2510.23536v12 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses a critical gap in personalized question answering by providing a benchmark to measure intent identification capabilities, which is essential for generating responses that satisfy individual information needs.

The paper tackles the problem of evaluating intent identification in personalized question answering by introducing IPQA, a benchmark for core intent identification, and finds that current language models struggle with this task, with performance degrading as question complexity increases.

Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.

View on arXiv PDF

Similar