CLAISep 26, 2025

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

arXiv:2510.01266v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses safety risks for underrepresented communities using AI models in low-resource languages, highlighting a gap in current red-teaming efforts.

The paper investigated safety alignment issues in OpenAI's GPT-OSS-20B model for the low-resource language Hausa, revealing vulnerabilities such as generating harmful content and factual inaccuracies, with a survey showing 98% of participants identified toxic substances as safe in model outputs.

In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes