CLSep 23, 2025

Are most sentences unique? An empirical examination of Chomskyan claims

arXiv:2509.19108v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses a foundational claim in linguistics for researchers, but it is incremental as it applies existing methods to new data.

The paper empirically investigates the claim that most linguistic utterances are unique by analyzing corpora of different genres using the NLTK Python library, finding that while unique sentences are often the majority, this varies by genre and duplicates are not insignificant.

A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that "virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe." With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes