Pragmatic Constraint on Distributional Semantics
This addresses a fundamental limitation in natural language processing for researchers and practitioners, but it is incremental as it builds on known statistical patterns.
The paper investigates how Zipf's law token distributions in language models affect statistical learning, showing that tokens with one-to-one semantic correspondence have distinct statistical properties from ambiguous tokens, which interferes with distributional semantics methods.
This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.