Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information
This work addresses privacy-preserving topic modeling for short texts in federated settings, offering a novel solution to performance degradation from data heterogeneity, though it is incremental as it builds on existing federated learning and NMF methods.
The paper tackles the problem of training high-quality topic models on private, distributed short text data by proposing a federated non-negative matrix factorization framework with mutual information (FedNMF+MI) to handle data heterogeneity, resulting in significant improvements over baselines in coherence and classification F1 scores.
Non-negative matrix factorization (NMF) based topic modeling is widely used in natural language processing (NLP) to uncover hidden topics of short text documents. Usually, training a high-quality topic model requires large amount of textual data. In many real-world scenarios, customer textual data should be private and sensitive, precluding uploading to data centers. This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data. However, standard federated learning will significantly undermine the performance of topic models in downstream tasks (e.g., text classification) when the data distribution over clients is heterogeneous. To alleviate this issue, we further propose FedNMF+MI, which simultaneously maximizes the mutual information (MI) between the count features of local texts and their topic weight vectors to mitigate the performance degradation. Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts by a significant margin on both coherence score and classification F1 score.