When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation

Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre

arXiv:2603.01945v10.6h-index: 10

Originality Incremental advance

AI Analysis

This work addresses the challenge of aligning human and automated evaluations for topic models, particularly in domain-specific applications, though it is incremental in extending existing evaluation frameworks.

The paper tackles the problem of evaluating topic models in specialized domains by introducing Topic Word Mixing (TWM), a novel human evaluation task that assesses inter-topic distinctness, and finds that automated metrics like coherence and diversity do not always align with human judgment, with results based on nearly 4,000 annotations from a philosophy of science corpus.

Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.

View on arXiv PDF

Similar