CLAug 11, 2025

Jinx: Unlimited LLMs for Probing Alignment Failures

arXiv:2508.08243v31 citations
Originality Synthesis-oriented
AI Analysis

This provides researchers with an accessible tool for evaluating alignment failures in language models, though it is incremental as it adapts existing models.

The authors tackled the problem of limited access to helpful-only language models for alignment evaluation by introducing Jinx, a variant of open-weight LLMs that responds to all queries without refusals, enabling researchers to probe alignment failures and study safety boundaries.

Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes