CLNov 26, 2025

Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

arXiv:2511.21334v12.7

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of understanding emergent linguistic structures in language models for researchers, but it is incremental as it applies an existing law to new model-generated data.

The study investigated Martin's Law in neural language models, finding that the relationship between word frequency and polysemy emerges non-monotonically during training, with peak correlation around r > 0.6 at checkpoint 104, and smaller models showing catastrophic semantic collapse while larger ones degrade gracefully.

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

View on arXiv PDF

Similar