CLMay 14, 2019
On the number of k-skip-n-grams
arXiv:1905.05407v1
Originality Synthesis-oriented
AI Analysis
This provides a theoretical foundation for text processing tasks, but it is incremental as it builds on existing skip-gram concepts without introducing new applications or methods.
The paper tackles the problem of counting k-skip-n-grams in a corpus, deriving a closed-form formula for the number of such sequences based on corpus size L, skip parameter k, and n-gram length n.
The paper proves that the number of k-skip-n-grams for a corpus of size $L$ is $$\frac{Ln + n + k' - n^2 - nk'}{n} \cdot \binom{n-1+k'}{n-1}$$ where $k' = \min(L - n + 1, k)$.