Towards Token-Level Text Anomaly Detection
This work addresses the need for fine-grained anomaly localization in text for applications like spam filtering and fake news detection, representing a novel paradigm rather than an incremental improvement.
The paper tackles the problem of identifying specific anomalous parts within text, which existing document-level methods cannot do, by introducing token-level anomaly detection and demonstrating that their framework outperforms six baselines on three benchmark datasets.
Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.