CLDec 1, 2015

Multilingual Language Processing From Bytes

arXiv:1512.00103v2227 citations
Originality Incremental advance
AI Analysis

This addresses the problem of language-specific preprocessing in NLP by enabling a single compact model to handle multiple languages, though it is incremental in improving multilingual efficiency.

The paper tackles multilingual NLP by introducing Byte-to-Span (BTS), an LSTM model that processes text as bytes to output span annotations, achieving results similar to or better than state-of-the-art in Part-of-Speech tagging and Named Entity Recognition across languages without external data.

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes