CLLGNov 20, 2025

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

arXiv:2511.16540v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This provides a proof of concept for interpretability in LLMs, which is incremental as it uses shallow learning models on existing data.

The paper tackles the problem of interpreting Large Language Models (LLMs) by predicting text genre from model activations, achieving F1-scores of up to 98% and 71% using scikit-learn classifiers on Mistral-7B.

Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes