CLAINov 21, 2025

The PLLuM Instruction Corpus

arXiv:2511.17161v12 citations
Originality Synthesis-oriented
AI Analysis

This work provides a resource and insights for developing similar datasets for other language models, but it is incremental as it focuses on a specific domain (Polish LLMs).

The paper introduces the PLLuM instruction corpus, a dataset used to fine-tune Polish large language models, and analyzes the implications of human-authored versus synthetic instructions for linguistic adaptation.

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes