How is BERT surprised? Layerwise detection of linguistic anomalies
This work provides insights into the internal mechanisms of language models for anomaly detection, which is incremental as it builds on existing methods to analyze model behavior.
The study investigated how transformer language models detect linguistic anomalies by analyzing surprisal across layers, finding that lower layers correlate with token frequency and that different anomaly types (morphosyntactic, semantic, commonsense) elicit surprisal at varying depths, with RoBERTa performing best.
Transformer language models have shown remarkable ability in detecting when a word is anomalous in context, but likelihood scores offer no information about the cause of the anomaly. In this work, we use Gaussian models for density estimation at intermediate layers of three language models (BERT, RoBERTa, and XLNet), and evaluate our method on BLiMP, a grammaticality judgement benchmark. In lower layers, surprisal is highly correlated to low token frequency, but this correlation diminishes in upper layers. Next, we gather datasets of morphosyntactic, semantic, and commonsense anomalies from psycholinguistic studies; we find that the best performing model RoBERTa exhibits surprisal in earlier layers when the anomaly is morphosyntactic than when it is semantic, while commonsense anomalies do not exhibit surprisal at any intermediate layer. These results suggest that language models employ separate mechanisms to detect different types of linguistic anomalies.