OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
This addresses the challenge of costly and inefficient reasoning over long audio-video data for QA tasks, representing an incremental improvement with specific optimizations for low-resource settings.
The paper tackles the problem of low-resource long audio-video question answering by proposing OmniRAG-Agent, an agentic omnimodal method that uses retrieval-augmented generation and an agent loop with tool use, achieving consistent outperformance over prior methods on benchmarks like OmniVideoBench, WorldSense, and Daily-Omni.
Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.