Oddballness: universal anomaly detection with language models
This addresses the problem of detecting anomalies in text and other sequences for applications like error detection, though it appears incremental as it builds on existing language model approaches.
The paper tackles unsupervised anomaly detection in sequences by introducing 'oddballness', a new metric that measures token strangeness using language model probabilities, and demonstrates its superiority over low-likelihood methods in grammatical error detection tasks.
We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.