Hierarchical Character-Word Models for Language Identification
This addresses language identification challenges for social media analysis, but is incremental as it builds on existing hierarchical approaches.
The authors tackled language identification in short social media texts with unconventional spelling by introducing a hierarchical character-word model, achieving strong performance against baselines and enabling code-switching detection.
Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong base- lines, and can also reveal code-switching.