Towards Uncovering How Large Language Model Works: An Explainability Perspective
This work addresses the lack of transparency in LLMs, which is a critical problem for researchers and practitioners aiming to ensure safe and beneficial deployment, though it is incremental as it reviews and synthesizes existing techniques.
This paper tackles the problem of understanding the opaque internal mechanisms of large language models (LLMs) by reviewing and summarizing explainability techniques, such as mechanistic interpretability and probing, to uncover how knowledge is encoded and represented, with the result being a framework that can enhance performance, efficiency, and alignment.
Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is architecturally composed within LLMs and encoded in their internal parameters via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.