CVAIApr 30, 2025

Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

arXiv:2504.21831v2h-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses efficient video summarization for resource-constrained applications, presenting an incremental improvement over existing methods.

The paper tackles video summarization by proposing DEEVISum, a lightweight vision-language model that uses multi-stage knowledge distillation and early exit techniques, achieving a 61.1 F1 score on the TVSum dataset while reducing inference time by 21% with a minimal performance drop.

We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes