CL APAug 25, 2023

Assessing Keyness using Permutation Tests

arXiv:2308.13383v11 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses a methodological issue for corpus linguists by providing a more robust statistical framework, though it is incremental as it builds on prior suggestions.

The paper tackles the problem of false positives in keyness assessment in corpus linguistics by proposing a resampling-based approach that models corpora as samples of documents rather than tokens, resulting in more accurate p-values for scores like LLR and enabling significance assessment for measures such as logratio.

We propose a resampling-based approach for assessing keyness in corpus linguistics based on suggestions by Gries (2006, 2022). Traditional approaches based on hypothesis tests (e.g. Likelihood Ratio) model the copora as independent identically distributed samples of tokens. This model does not account for the often observed uneven distribution of occurences of a word across a corpus. When occurences of a word are concentrated in few documents, large values of LLR and similar scores are in fact much more likely than accounted for by the token-by-token sampling model, leading to false positives. We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens, which is much closer to the way corpora are actually assembled. We then use a permutation approach to approximate the distribution of a given keyness score under the null hypothesis of equal frequencies and obtain p-values for assessing significance. We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score. Hence, appart from obtaining more accurate p-values for scores like LLR, we can also assess significance for e.g. the logratio which has been proposed as a measure of effect size. An efficient implementation of the proposed approach is provided in the `R` package `keyperm` available from github.

View on arXiv PDF

Similar