CRAICLITOct 30, 2025

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

arXiv:2510.26847v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This provides a practical defense for real-time text filtering against obfuscated prompts, though it is incremental as it builds on known tokenizer behavior.

The paper tackles the problem of jailbreak attacks on large language models where malicious prompts are disguised using ciphers, and introduces CPT-Filtering, a model-agnostic guardrail technique that identifies encoded text with near-perfect accuracy by analyzing characters per token, validated on over 100,000 prompts.

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes