HLTCOE Evaluation Team at TREC 2025: VQA Track
This work addresses the challenge of generating coherent and fine-grained answer lists in video question answering, representing an incremental improvement through a hybrid method.
The team tackled the problem of improving semantic precision and ranking consistency in video question answering by developing a listwise learning framework that reranks candidate answers using a novel loss function, resulting in consistent gains in accuracy and ranking stability, particularly for temporal reasoning and semantic disambiguation tasks.
The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.