LGAICYSep 22, 2025

Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

arXiv:2509.17665v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses the understudied problem of religious bias in LLMs for AI safety and fairness researchers, though it is incremental in applying existing interpretability methods to a new domain.

This paper investigated how large language models internally represent religious identity and its intersections with violence and geography using mechanistic interpretability and Sparse Autoencoders, finding that Islam was more frequently linked to violent language features while geographic associations reflected real-world demographics.

Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes