CLAICVLGFeb 7, 2025

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

arXiv:2502.05242v22 citationsh-index: 10
Originality Highly original
AI Analysis

This work addresses the problem of monitoring large language models for developers and users of AI systems, providing an incremental solution to improve transparency and trustworthiness.

The authors tackled the problem of enhancing transparency of large language models, resulting in consistent improvement in transparency and task performance. Their proposed method, TELLME, achieved this through improved trustworthiness tasks such as safety risks monitoring and detoxification tasks.

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes