CLLGMLOct 24, 2018

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

arXiv:1810.10222v18 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient transfer learning for languages with high inflection, such as Polish, by providing an incremental adaptation of an existing method.

The authors tackled the challenge of adapting the ULMFiT method to highly inflected languages like Polish by using subword tokenization, achieving a new state-of-the-art result in a Polish NLP competition with a 35% improvement over the second-best model.

Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is one of the first NLP methods for efficient inductive transfer learning. Unsupervised pretraining results in improvements on many NLP tasks for English. In this paper, we describe a new method that uses subword tokenization to adapt ULMFiT to languages with high inflection. Our approach results in a new state-of-the-art for the Polish language, taking first place in Task 3 of PolEval'18. After further training, our final model outperformed the second best model by 35%. We have open-sourced our pretrained models and code.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes