IR CLMay 30, 2025

Is BERTopic Better than PLSA for Extracting Key Topics in Aviation Safety Reports?

Aziida Nanyonga, Joiner Keith, Turhan Ugur, Wild Graham

arXiv:2506.06328v13.61 citationsh-index: 102025 3rd International Conference on Artificial Intelligence and Machine Learning Applications Theme: Healthcare and Internet of Things (AIMLA)

Originality Synthesis-oriented

AI Analysis

It addresses the problem of analyzing complex aviation incident data for safety experts, but is incremental as it compares existing methods on a specific dataset.

This study compared BERTopic and PLSA for extracting topics from aviation safety reports, finding that BERTopic achieved higher topic coherence (Cv score 0.41 vs. 0.37) and better interpretability.

This study compares the effectiveness of BERTopic and Probabilistic Latent Semantic Analysis (PLSA) in extracting meaningful topics from aviation safety reports aiming to enhance the understanding of patterns in aviation incident data. Using a dataset of over 36,000 National Transportation Safety Board (NTSB) reports from 2000 to 2020, BERTopic employed transformer based embeddings and hierarchical clustering, while PLSA utilized probabilistic modelling through the Expectation-Maximization (EM) algorithm. Results showed that BERTopic outperformed PLSA in topic coherence, achieving a Cv score of 0.41 compared to PLSA 0.37, while also demonstrating superior interpretability as validated by aviation safety experts. These findings underscore the advantages of modern transformer based approaches in analyzing complex aviation datasets, paving the way for enhanced insights and informed decision-making in aviation safety. Future work will explore hybrid models, multilingual datasets, and advanced clustering techniques to further improve topic modelling in this domain.

View on arXiv PDF

Similar