CLSep 24, 2020

Novel Keyword Extraction and Language Detection Approaches

Malgorzata Pikies, Andronicus Riyono, Junade Ali

arXiv:2009.11832v10.2

Originality Incremental advance

AI Analysis

This work addresses efficiency and accuracy issues in NLP pipelines for language detection, but it is incremental as it builds on existing methods with specific optimizations.

The paper tackled the problem of improving efficiency in fuzzy string matching for language detection by proposing a fast tokenization method, achieving an 83.6% decrease in processing time with a 3.1% recall improvement at a 2.6% precision cost, and found that using the Accept-Language header is 14% more effective than IP addresses for language classification.

Fuzzy string matching and language classification are important tools in Natural Language Processing pipelines, this paper provides advances in both areas. We propose a fast novel approach to string tokenisation for fuzzy language matching and experimentally demonstrate an 83.6% decrease in processing time with an estimated improvement in recall of 3.1% at the cost of a 2.6% decrease in precision. This approach is able to work even where keywords are subdivided into multiple words, without needing to scan character-to-character. So far there has been little work considering using metadata to enhance language classification algorithms. We provide observational data and find the Accept-Language header is 14% more likely to match the classification than the IP Address.

View on arXiv PDF

Similar