CLJun 28, 2024

Automatic register identification for the open web using multilingual deep learning

arXiv:2406.19892v45 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of automatically classifying text varieties on the open web for multilingual applications, but it is incremental as it builds on existing classification methods with a more complex scheme.

The authors tackled the problem of identifying web registers across 16 languages using multilingual deep learning, achieving 79% F1 averaged across languages with their best model, which matches or exceeds previous studies, and found that performance could increase to over 90% F1 after data pruning.

This article presents multilingual deep learning models for identifying web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid texts (those combining multiple registers) reveals that the main challenge lies not in classifying hybrids themselves, but in distinguishing hybrid from non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly for languages with limited training data. Zero-shot performance on unseen languages drops by an average of 7%, though this varies by language (3--8%), indicating that while registers share features across languages, they also retain language-specific characteristics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes