World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
This work challenges the interpretation of LLM internal representations for researchers studying how LLMs encode world knowledge, suggesting that simpler text statistics already capture significant spatial and temporal structure.
This paper investigates whether spatial and temporal information is inherently present in static word embeddings, rather than solely in large language model (LLM) hidden states. By applying ridge regression probes to GloVe and Word2Vec embeddings, the authors recovered substantial geographic signals (R^2 values of 0.71-0.87 for city coordinates) and reliable temporal signals (R^2 values of 0.48-0.52 for historical birth years).
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.