Method Drift›Retrieval-augmented generation
Video-RAG
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionRetrieval-augmented generation · first seen Nov 20, 2024
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Video-RAG as a baseline.
“appending textual context such as ASR, OCR, or descriptions to the prompt”
— Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning“This contrasts with existing agent- or retrieval-augmented generation-based methods~VideoAgent, Video-Agent, Video-RAG, DrVideo, which rely heavily on external tools for frame-level information extraction, limiting their capacity to respond to diverse queries due to the inherent constraints of these tools.”
— VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Beaten on benchmarks
Head-to-head results where a newer method reports beating Video-RAG. Values are copied from the source paper's tables — verify against the cited paper.
- Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG beats Video-RAG · MLVU [LLaVA-Video 7B]
75.5 vs 71.3
- Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG beats Video-RAG · WildVideo [LLaVA-Video 7B]
57.0 vs 48.5
- Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG beats Video-RAG · MLVU [Qwen2.5-VL 7B]
73.4 vs 63.4
- Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG beats Video-RAG · WildVideo [Qwen2.5-VL 7B]
55.4 vs 47.2
- Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG beats Video-RAG · VideoMME (w/o sub) [Qwen2.5-VL 7B]
70.6 vs 60.5
- VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir beats Video-RAG · Overall [LLaVA-Video (7B)]
60.3 vs 58.7
- VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir beats Video-RAG · Overall [mPLUG-Owl3 (8B)]
55.1 vs 54.5
- VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir beats Video-RAG · Overall [InternVL-1.5 (26B)]
52.7 vs 52.2
- VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir beats Video-RAG · Overall [LLaVA-Video (72B)]
66.0 vs 65.4
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Apr 7, 2026
- Graph-to-Frame RAGGraph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video ReasoningApr 6, 2026
- Apr 4, 2026
- AutoThinkRAGAutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text InteractionMar 17, 2026
- Feb 27, 2026
- VimRAGVimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory GraphFeb 13, 2026
- Feb 5, 2026
- Feb 1, 2026
- Oct 8, 2025