CL LG MLAug 21, 2020

Neural Machine Translation without Embeddings

arXiv:2008.09396v227.9730 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for simpler, more universal text processing in NLP, though it is incremental as it builds on existing byte-level approaches.

The paper tackled the problem of eliminating hand-crafted tokenization in neural machine translation by using byte-level representations instead of embeddings, resulting in consistent BLEU score improvements across 10 languages that rival character-level and subword models.

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.

View on arXiv PDF Code

Similar