CLAIApr 7, 2025

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

arXiv:2504.04994v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the need for interpretability in AI safety by providing insights into value mechanisms in LLMs, though it is incremental as it focuses on a specific domain (Chinese Social Values).

The paper tackles the problem of understanding how social values are encoded in large language models (LLMs) by proposing ValueExploration, a framework that identifies and deactivates neurons responsible for Chinese Social Values, revealing mechanisms behind value-driven behaviors in LLMs.

Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes