CL LGSep 14, 2025

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

arXiv:2509.11465v28.33 citationsh-index: 61EMNLP

Originality Incremental advance

AI Analysis

This addresses the challenge of multimodal topic modeling for researchers and practitioners, offering improved performance in tasks like retrieval and semantic analysis, though it is incremental as it builds on existing vision-language models.

The paper tackled the problem of inferring coherent topic structures from multimodal documents containing text and images, and the result was that CEMTM outperformed baselines with an average LLM score of 2.61 across six benchmarks.

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

View on arXiv PDF

Similar