CVNov 19, 2025

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

arXiv:2511.15923v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of adapting vision language models to domain-specific video analysis with limited data, offering an incremental improvement in annotation efficiency.

The paper tackles the problem of domain-specific video classification with limited data by proposing a two-stage fine-tuning method that uses self-generated rationales to bridge the rationale gap, resulting in significantly outperforming direct supervised fine-tuning on diverse datasets.

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes