CEMTM: Contextual Embedding-based Multimodal Topic Modeling
This addresses the challenge of multimodal topic modeling for researchers and practitioners, offering improved performance in tasks like retrieval and semantic analysis, though it is incremental as it builds on existing vision-language models.
The paper tackled the problem of inferring coherent topic structures from multimodal documents containing text and images, and the result was that CEMTM outperformed baselines with an average LLM score of 2.61 across six benchmarks.
We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.