CLIRMar 2

Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis

arXiv:2603.01791v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This research addresses the problem of quantifying narrative evolution and interestingness in literature for computational linguistics and digital humanities, though it is incremental in applying existing theories to new data.

The study analyzed semantic novelty trajectories in over 80,000 English-language books from two centuries, finding that modern books have 10% higher mean paragraph-level novelty and nearly double the trajectory circuitousness compared to pre-1920 literature.

I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes