Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching
This addresses the challenge of precise temporal grounding in sports coaching videos, where obtaining frame-level supervision is expensive and unreliable, offering an incremental improvement over existing methods.
The paper tackles the problem of video-LLMs attending to irrelevant frames in sports coaching tasks by proposing a self-consistency objective that enforces temporal grounding between related tasks without additional annotations, resulting in gains of up to +14.1% accuracy and +0.9 BERTScore over supervised finetuning.
Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.