CL AI CRSep 25, 2025

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

arXiv:2510.01259v1Has Code

Originality Incremental advance

AI Analysis

This work addresses vulnerabilities in AI safety guardrails for users and developers, highlighting risks in model deployment and evaluation, but it is incremental as it builds on existing auditing methods.

The study probed OpenAI's gpt-oss-20b model to analyze how sociopragmatic factors influence refusal behavior in harmful tasks, finding that composite prompts increased assistance rates from 0% to 97.5% for ZIP-bomb construction and that language choice and role-play affected leakage, with evaluation awareness showing inconsistencies in 13% of cases.

We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

View on arXiv PDF Code

Similar