CL LG SPApr 28, 2024

Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Li Wan, Tansu Alpcan, Margreta Kuijper, Emanuele Viterbo

arXiv:2405.01584v11.01 citationsh-index: 48IEEE Trans Knowl Data Eng

Originality Incremental advance

AI Analysis

This work addresses efficient text classification for resource-constrained applications, though it is incremental as it builds on existing dictionary learning and compression methods.

The paper tackles text classification by proposing a lightweight supervised dictionary learning framework based on data compression, which achieves competitive performance with top models (deviating by only ~2%) on limited-vocabulary datasets while using just 10% of the parameters, but it falls short on diverse-vocabulary datasets due to LZW algorithm constraints.

We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.

View on arXiv PDF

Similar