AICLJun 2, 2025

Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

arXiv:2506.01689v12 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of multimodal AI in handling complex user intents that need video demonstrations, though it is incremental as it focuses on benchmarking rather than model improvement.

The authors tackled the problem of evaluating text-to-video models for answering real-world user queries that require visual responses by constructing the RealVideoQuest benchmark with 4.5K query-video pairs, and found that current models struggle with these tasks.

Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes