CLITMar 5, 2025

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

arXiv:2503.03601v113 citationsh-index: 36ACL
Originality Incremental advance
AI Analysis

This work addresses interpretability for researchers and practitioners in ATD, but it is incremental as it builds on existing methods with a focus on feature analysis.

The study tackled the problem of inconsistent performance in Artificial Text Detection (ATD) across unseen text and new LLMs by using Sparse Autoencoders to extract features from Gemma-2-2b, revealing that modern LLMs have a distinct writing style in information-dense domains despite producing human-like outputs.

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes