CLDec 1, 2015

Multilingual Language Processing From Bytes

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya

arXiv:1512.00103v227.7227 citations

Originality Incremental advance

AI Analysis

This addresses the problem of language-specific preprocessing in NLP by enabling a single compact model to handle multiple languages, though it is incremental in improving multilingual efficiency.

The paper tackles multilingual NLP by introducing Byte-to-Span (BTS), an LSTM model that processes text as bytes to output span annotations, achieving results similar to or better than state-of-the-art in Part-of-Speech tagging and Named Entity Recognition across languages without external data.

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

View on arXiv PDF

Similar