CVCLMMSep 30, 2025

FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

arXiv:2509.25745v12 citationsh-index: 29
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of automated captioning for financial short-form videos, establishing first baselines but is incremental in nature.

The study evaluated multimodal large language models for generating topic-aligned captions on 624 financial short-form YouTube videos, finding that video alone performed strongly on four of five topics, while selective modality pairs often outperformed full combinations due to noise.

We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes