CVMar 12, 2025

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

arXiv:2503.09279v12 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses limitations in video captioning for vision-language applications, but it is incremental as it builds on existing methods with a novel training approach.

The paper tackled the problem of biased capability and misalignment with human preferences in Video Detailed Captioning (VDC) by proposing Cockatiel, a three-stage training pipeline that ensembles synthetic and human-aligned training, resulting in new state-of-the-art performance on VDCSCORE and surpassing alternatives on human preference by a large margin.

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes