CLAug 16, 2016

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn

arXiv:1608.04465v111.028 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses efficiency challenges in language modeling for large datasets, offering a more memory-efficient and faster solution, though it is incremental as it builds on existing suffix tree techniques.

The paper tackles the problem of scaling high-order n-gram language models to large corpora by proposing a method based on compressed suffix trees, which achieves up to 2500x faster query runtimes with only modest increases in construction time and memory usage, while being competitive with state-of-the-art packages like KenLM in terms of performance.

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

View on arXiv PDF Code

Similar