CLOct 24, 2025

Typoglycemia under the Hood: Investigating Language Models' Understanding of Scrambled Words

arXiv:2510.21326v12.7h-index: 17

Originality Synthesis-oriented

AI Analysis

This addresses a fundamental question about model robustness for NLP researchers, but it is incremental as it builds on known phenomena.

The paper investigated why language models remain robust to scrambled words (typoglycemia) in English, finding that few words collapse into identical forms and those that do occur in easily disambiguated contexts, with BERT showing minimal performance degradation.

Research in linguistics has shown that humans can read words with internally scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP models have recently been proposed that similarly demonstrate robustness to such distortions by ignoring the internal order of characters by design. This raises a fundamental question: how can models perform well when many distinct words (e.g., form and from) collapse into identical representations under typoglycemia? Our work, focusing exclusively on the English language, seeks to shed light on the underlying aspects responsible for this robustness. We hypothesize that the main reasons have to do with the fact that (i) relatively few English words collapse under typoglycemia, and that (ii) collapsed words tend to occur in contexts so distinct that disambiguation becomes trivial. In our analysis, we (i) analyze the British National Corpus to quantify word collapse and ambiguity under typoglycemia, (ii) evaluate BERT's ability to disambiguate collapsing forms, and (iii) conduct a probing experiment by comparing variants of BERT trained from scratch on clean versus typoglycemic Wikipedia text; our results reveal that the performance degradation caused by scrambling is smaller than expected.

View on arXiv PDF

Similar