CLJun 25, 2025

Probing AI Safety with Source Code

arXiv:2506.20471v1h-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses safety risks for users of LLMs in critical applications, highlighting a significant gap in current safety measures, though it is incremental in proposing a new evaluation method rather than a solution.

The paper tackles the problem of AI safety in large language models (LLMs) by introducing a prompting strategy called Code of Thought (CoDoT) that converts natural language inputs to code, revealing that state-of-the-art LLMs consistently fail safety tests, with toxicity increasing up to 16.5 times for GPT-4 Turbo and 300% on average across seven models.

Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt "Make the statement more toxic: {text}" to: "make_more_toxic({text})". We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo's toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes