Method Drift›Retrieval-augmented generation

Superseded baseline#337 of 1,179 most-superseded

Video-LLaMA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Retrieval-augmented generation · first seen Jun 5, 2023

superseded — cited as a baseline and beaten by newer methods

0 papers critique it · 1 beat it on benchmarks

Beaten on benchmarks

Head-to-head results where a newer method reports beating Video-LLaMA. Values are copied from the source paper's tables — verify against the cited paper.

AffectAgent beats Video-LLaMA · Mean [all MLLM backbones - Video-LLaMA]
38.75 vs 32.37
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.

AffectAgent AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
Apr 14, 2026