Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis
It provides a scalable, non-AI alternative for language detection, though it is incremental as it builds on classical frequency-based methods.
The paper tackled language identification by using monograms and bigrams frequency rankings with the Minkowski norm, achieving over 80% accuracy on short texts and 100% on longer ones.
The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.