CLJun 19, 2025

PL-Guard: Benchmarking Language Model Safety for Polish

arXiv:2506.16322v12 citationsh-index: 4Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of language bias in AI safety for Polish speakers, but it is incremental as it adapts existing methods to a new language.

The authors tackled the lack of safety assessments for non-English languages by creating a manually annotated benchmark dataset for language model safety classification in Polish, including adversarial variants, and found that a HerBERT-based classifier achieved the highest overall performance, especially under adversarial conditions.

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes