CL ITMar 5, 2025

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov

arXiv:2503.03601v114.714 citationsh-index: 36ACL

Originality Incremental advance

AI Analysis

This work addresses interpretability for researchers and practitioners in ATD, but it is incremental as it builds on existing methods with a focus on feature analysis.

The study tackled the problem of inconsistent performance in Artificial Text Detection (ATD) across unseen text and new LLMs by using Sparse Autoencoders to extract features from Gemma-2-2b, revealing that modern LLMs have a distinct writing style in information-dense domains despite producing human-like outputs.

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

View on arXiv PDF

Similar