CLSep 28, 2016

Byte-based Language Identification with Deep Convolutional Networks

arXiv:1609.09004v225 citations
AI Analysis

This work addresses the problem of discriminating between similar languages for NLP applications, but it is incremental as it applies an existing method (ResNet) to a new data type (byte representations).

The authors tackled language identification for similar languages using a byte-based deep residual network, achieving accuracies of 84.88% on subtask A, 68.80% on subtask B1, and 69.80% on subtask B2.

We report on our system for the shared task on discriminating between similar languages (DSL 2016). The system uses only byte representations in a deep residual network (ResNet). The system, named ResIdent, is trained only on the data released with the task (closed training). We obtain 84.88% accuracy on subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A large difference in accuracy on development data can be observed with relatively minor changes in our network's architecture and hyperparameters. We therefore expect fine-tuning of these parameters to yield higher accuracies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes