IRCLLGApr 3, 2020

Testing pre-trained Transformer models for Lithuanian news clustering

arXiv:2004.03461v11 citations
AI Analysis

This work addresses clustering for Lithuanian news, an incremental improvement for a low-resource language domain.

The study compared pre-trained multilingual Transformer models (BERT, XLM-R) with older methods for Lithuanian news clustering, finding that fine-tuned Transformers outperform word vectors but underperform doc2vec embeddings.

A recent introduction of Transformer deep learning architecture made breakthroughs in various natural language processing tasks. However, non-English languages could not leverage such new opportunities with the English text pre-trained models. This changed with research focusing on multilingual models, where less-spoken languages are the main beneficiaries. We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering. Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes