Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives
This work addresses the challenge of capturing cultural nuances in topic modeling for underrepresented languages like Belgian Dutch, though it is incremental as it applies existing methods to a new dataset.
The study evaluated topic modeling methods on 25,000 Belgian Dutch daily narratives, finding that BERTopic identified the most coherent and culturally relevant topics, while LDA performed well on automated metrics but less so in human evaluation.
Standard topic models often struggle to capture culturally specific nuances in text. This study evaluates the effectiveness of contextual embeddings for identifying culturally resonant themes in an underrepresented linguistic context. We compare the performance of KMeans Clustering, Latent Dirichlet Allocation (LDA), and BERTopic on a corpus of nearly 25,000 daily personal narratives written in Belgian-Dutch (Flemish). While LDA achieves strong performance on automated coherence metrics, subsequent human evaluation reveals that BERTopic consistently identifies the most coherent and culturally relevant topics, highlighting the limitations of purely statistical methods on this narrative-rich data. Furthermore, the diminished performance of K-Means compared to prior work on similar Dutch corpora underscores the unique linguistic challenges posed by personal narrative analysis. Our findings demonstrate the critical role of contextual embeddings in robust topic modeling and emphasize the need for human-centered evaluation, particularly when working with low-resource languages and culturally specific domains.