CLAICYFeb 5, 2025

Sparse Autoencoders for Hypothesis Generation

arXiv:2502.04382v324 citationsh-index: 11ICML
Originality Incremental advance
AI Analysis

This method addresses the need for efficient and interpretable hypothesis generation in text analysis, offering a scalable alternative to compute-intensive LLM-based approaches for researchers and practitioners.

The paper tackled the problem of generating interpretable hypotheses linking text data to target variables by introducing HypotheSAEs, a method that uses sparse autoencoders and LLMs to produce natural language interpretations, achieving at least +0.06 F1 improvement on synthetic datasets and roughly twice as many significant findings on real datasets with much lower compute.

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes