$S^3$ -- Semantic Signal Separation
This addresses the need for efficient and stable topic modeling in natural language processing, though it appears incremental as it builds on existing contextual embedding methods.
The paper tackles the problem of slow and volatile contextual topic models by introducing Semantic Signal Separation (S^3), a theory-driven approach that uses Independent Component Analysis on contextualized embeddings, resulting in topics that are 4.5x faster than the runner-up BERTopic on average.
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of $S^3$, and all contextual baselines, in the Turftopic Python package.