CLDec 19, 2022

Less is More: Parameter-Free Text Classification with Gzip

Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin

arXiv:2212.09410v12.111 citationsh-index: 18

Originality Incremental advance

AI Analysis

This provides a lightweight, parameter-free alternative for text classification, particularly beneficial for resource-constrained or out-of-distribution scenarios, though it is incremental as it builds on existing compression and k-NN techniques.

The authors tackled the problem of computationally intensive deep neural networks for text classification by proposing a non-parametric method using gzip compression and k-nearest-neighbor, achieving competitive results with non-pretrained deep learning on six in-distribution datasets and outperforming BERT on five out-of-distribution datasets, including low-resource languages.

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

View on arXiv PDF

Similar