CLLGSep 14, 2025

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

arXiv:2509.11465v23 citationsh-index: 61EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of multimodal topic modeling for researchers and practitioners, offering improved performance in tasks like retrieval and semantic analysis, though it is incremental as it builds on existing vision-language models.

The paper tackled the problem of inferring coherent topic structures from multimodal documents containing text and images, and the result was that CEMTM outperformed baselines with an average LLM score of 2.61 across six benchmarks.

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes