Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
This addresses a methodological issue for researchers analyzing neural network interpretability, revealing that common metrics may overestimate superposition due to lexical confounds, which is incremental but clarifies foundational assumptions.
The paper tackles the problem of distinguishing between superposition and lexical confounds in neural network activations, finding that lexical identity (same word form) accounts for more overlap than semantic similarity across models up to 70B parameters, and removing this confound improves tasks like word sense disambiguation with statistical significance (p = 0.002).
If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).