CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
This provides a more reliable and efficient evaluation tool for researchers working on interpretability in AI, though it is incremental as it builds on prior benchmarks.
The paper tackles the problem of evaluating interpretability in sparse autoencoders for large language models by introducing CE-Bench, a lightweight contrastive benchmark that achieves over 70% Spearman correlation with existing benchmarks without needing external LLMs.
Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available.