CLAILGMar 2, 2024

VBART: The Turkish LLM

arXiv:2403.01308v29 citationsh-index: 3Has Code
AI Analysis

This work addresses the problem of limited NLP resources for Turkish by providing efficient, high-performing models, though it is incremental as it builds on existing BART and mBART ideas.

The authors tackled the lack of Turkish-specific large language models by developing VBART, a sequence-to-sequence model pre-trained from scratch on Turkish data, which achieved state-of-the-art results in tasks like text summarization and question answering, with up to 3x performance improvements over multilingual models and an 11x more efficient tokenizer.

We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes