CLAIApr 18

When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations

arXiv:2604.1678754.2
Predicted impact top 99% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For NLP practitioners deploying models on user-generated content, this work provides targeted, low-cost fixes for informal text degradation without harming standard performance.

The paper identifies two distinct failure modes in NLI models when processing informal text: tokenization failure (emoji mapped to [UNK]) and distribution shift (unseen noise tokens). A hybrid mitigation (preprocessing + augmentation) recovers ELECTRA-small accuracy from 75.88% to 88.93% on combined SNLI variants, surpassing GPT-4o-mini zero-shot with no clean-text drop.

We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" -> "homie") causes minimal degradation (at most 1.1pp): slang vocabulary falls largely within WordPiece coverage, so the tokenizer handles it without signal loss. Emoji replaces content words with Unicode characters that ELECTRA's WordPiece tokenizer maps to [UNK], destroying the input signal before any learned parameters see it (93.6% of emoji examples contain at least one [UNK], mean 2.91 per example). Noise tokens (no cap, deadass, tbh) are fully in-vocabulary but absent from NLI training data, consistent with the model assigning them inferential weight they do not carry. The two failure modes respond to different interventions: preprocessing recovers emoji accuracy by normalizing text before tokenization; augmentation handles noise by exposing the model to noise-bearing examples during training. A hybrid of both achieves 88.93% on the combined variant for ELECTRA on SNLI (up from 75.88%), with no statistically significant drop on clean text. Against GPT-4o-mini zero-shot, unmitigated ELECTRA is significantly worse on transformed variants (p < 0.0001); hybrid ELECTRA surpasses it across all SNLI variants and reaches statistical parity on MultiNLI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes