CL LGDec 22, 2014

Language Recognition using Random Indexing

Aditya Joshi, Johan Halseth, Pentti Kanerva

arXiv:1412.7026v23 citations

Originality Synthesis-oriented

AI Analysis

This provides a computationally efficient method for language identification, though it is incremental as it applies an existing technique to a specific task.

The paper tackled language recognition by using Random Indexing with letter blocks to generate language representation vectors, achieving 97.8% accuracy on a dataset of 21,000 sentences from 21 languages, comparable to state-of-the-art methods.

Random Indexing is a simple implementation of Random Projections with a wide range of applications. It can solve a variety of problems with good accuracy without introducing much complexity. Here we use it for identifying the language of text samples. We present a novel method of generating language representation vectors using letter blocks. Further, we show that the method is easily implemented and requires little computational power and space. Experiments on a number of model parameters illustrate certain properties about high dimensional sparse vector representations of data. Proof of statistically relevant language vectors are shown through the extremely high success of various language recognition tasks. On a difficult data set of 21,000 short sentences from 21 different languages, our model performs a language recognition task and achieves 97.8% accuracy, comparable to state-of-the-art methods.

View on arXiv PDF

Similar