CLLGFeb 11, 2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

arXiv:2102.06282v1800 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work provides a reproduction and validation of an existing method for language identification in short text fragments, which is incremental in nature.

The authors reproduced Apple's bi-directional LSTM model for language identification in short strings, confirming its performance and showing it outperforms current open-source identifiers, with mistakes primarily due to confusion between related languages.

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes