CLIRSTJul 21, 2025

A Fisher's exact test justification of the TF-IDF term-weighting scheme

arXiv:2507.15742v23 citationsh-index: 33Am Stat
Originality Synthesis-oriented
AI Analysis

This provides a statistical foundation for TF-IDF, aiding statisticians in explaining its effectiveness, but it is incremental as it builds on existing theoretical work.

The paper tackles the theoretical justification of the TF-IDF term-weighting scheme by showing that a common variant, TF-ICF, is closely related to the negative logarithm of the p-value from a one-tailed Fisher's exact test under mild conditions, and converges to TF-IDF in the limit of an infinitely large document collection.

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher's exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher's exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme's long-established effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes