Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models
This addresses how LLMs represent trustworthiness for AI system designers, offering insights for building more credible systems, though it appears incremental in analyzing existing models.
The researchers investigated whether large language models (LLMs) encode psychologically coherent representations of trustworthiness in web narratives, finding that models systematically distinguish high- from low-trust texts through layer- and head-level activation differences, with strongest associations to human trust dimensions like fairness and certainty.
Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction-tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web-like narratives using the PEACE-Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer- and head-level activation differences distinguish high- from low-trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de-codable trust signals and fine-tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability-self -- dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust-worthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.