N-gram and Neural Language Models for Discriminating Similar Languages
This work addresses language discrimination for NLP tasks, but it is incremental as it applies existing methods to a shared task without major innovations.
The paper tackled the problem of discriminating similar languages by comparing a character-based convolutional neural network with a bidirectional LSTM (CLSTM) and a character-based n-gram model, achieving accuracies of 78.45% and 88.45% respectively, with the n-gram model ranking #7 overall.
This paper describes our submission (named clac) to the 2016 Discriminating Similar Languages (DSL) shared task. We participated in the closed Sub-task 1 (Set A) with two separate machine learning techniques. The first approach is a character based Convolution Neural Network with a bidirectional long short term memory (BiLSTM) layer (CLSTM), which achieved an accuracy of 78.45% with minimal tuning. The second approach is a character-based n-gram model. This last approach achieved an accuracy of 88.45% which is close to the accuracy of 89.38% achieved by the best submission, and allowed us to rank #7 overall.