LGMay 28

MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

arXiv:2605.2976557.0
Predicted impact top 41% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers working on topic modeling in multimodal long-form video, MMTM provides a deterministic fusion method that significantly improves topic coherence and temporal stability, though gains are corpus-dependent.

MMTM introduces a modular pipeline for topic discovery in long-form video that fuses speech, audio, and visual embeddings via similarity-gated fusion. On German and English broadcast news, it reduces noise from 0.27 to 0.06, transition rate from 0.70 to 0.21, and increases normalized entropy from 0.84 to 0.92, indicating more coherent and stable topics.

We introduce MMTM, a modular pipeline for topic discovery in long-form video that integrates speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, joint tri-modal modeling substantially improves topic quality: noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. We release the pipeline code and a human-validated 54-hour multimodal video topic corpus with dual-annotator visual evaluation and LLM-assisted labeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes