CL AI IR LGAug 27, 2019

Bridging the Gap for Tokenizer-Free Language Models

Dokook Choe, Rami Al-Rfou, Mandy Guo, Heeyoung Lee, Noah Constant

arXiv:1908.10322v12.527 citations

Originality Incremental advance

AI Analysis

This addresses the gap in language modeling for researchers and practitioners by demonstrating that tokenizer-free models can match word-based ones, though it is incremental as it builds on existing transformer architectures.

The paper tackled the problem of tokenizer-free language models lagging in quality on large datasets by showing that with sufficient capacity, they can achieve competitive performance, achieving a new state of the art on the One Billion Word benchmark.

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is essential to achieving competitive results. In this paper, we show that contrary to this conventional wisdom, tokenizer-free LMs with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.

View on arXiv PDF

Similar