CL-VISTA: Benchmarking Continual Learning in Video Large Language Models
This addresses the problem of evaluating continual learning in multimodal foundation models for researchers, but it is incremental as it focuses on benchmarking rather than a new learning method.
The authors tackled the lack of suitable benchmarks for evaluating continual learning in Video Large Language Models by proposing CL-VISTA, a benchmark with 8 diverse tasks that induces substantial distribution shifts and exposes catastrophic forgetting, and extensive benchmarking of 10 methods revealed a fundamental trade-off where no single approach excels across all dimensions.
Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.