CL LGFeb 11, 2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent

arXiv:2102.06282v132.7800 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work provides a reproduction and validation of an existing method for language identification in short text fragments, which is incremental in nature.

The authors reproduced Apple's bi-directional LSTM model for language identification in short strings, confirming its performance and showing it outperforms current open-source identifiers, with mistakes primarily due to confusion between related languages.

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

View on arXiv PDF Code

Similar