CLJun 2

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

arXiv:2512.0036056.1
Predicted impact top 98% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners building real-time educational video QA systems, this work provides a new benchmark and a practical method that balances retrieval accuracy with strict latency constraints.

The paper introduces CourseTimeQA, a benchmark for timestamped QA over lecture videos, and proposes CrossFusion-RAG, a latency-constrained cross-modal retriever that improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 baseline while achieving 1.55 s median latency on a single A100.

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes