CLNov 27, 2025

Token-Level Marginalization for Multi-Label LLM Classifiers

arXiv:2511.22312v14.91 citations

Originality Incremental advance

AI Analysis

This addresses the need for better confidence assessment in content moderation systems, though it is incremental as it builds on existing models like LLaMA Guard.

The paper tackled the problem of deriving interpretable confidence scores from generative LLMs for multi-label content safety classification, demonstrating that token-level probability estimation significantly improves interpretability and reliability.

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

View on arXiv PDF

Similar