Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
This work addresses efficiency challenges in language modeling for large datasets, offering a more memory-efficient and faster solution, though it is incremental as it builds on existing suffix tree techniques.
The paper tackles the problem of scaling high-order n-gram language models to large corpora by proposing a method based on compressed suffix trees, which achieves up to 2500x faster query runtimes with only modest increases in construction time and memory usage, while being competitive with state-of-the-art packages like KenLM in terms of performance.
Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).