CLMay 30, 2017

A Low Dimensionality Representation for Language Variety Identification

arXiv:1705.10754v1107 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of identifying specific language variations (e.g., regional dialects) for applications in natural language processing, but it is incremental as it focuses on a limited set of Spanish varieties.

The paper tackled language variety identification for Spanish by proposing a low dimensionality representation (LDR) method, achieving an increase in accuracy of ~35% compared to state-of-the-art representations and reducing features to only 6 per variety.

Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ~35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality --and increasing the big data suitability-- to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes