CLAIJan 26

BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

arXiv:2601.18253v1
Originality Highly original
AI Analysis

This work addresses the challenge of reliable satisfaction evaluation for open-ended AI assistants, offering a scalable solution for industrial applications.

The paper tackles the problem of evaluating user satisfaction in conversational AI by introducing BoRP, a scalable framework that uses bootstrapped regression probing to map LLM hidden states to scores, achieving better alignment with human judgments and reducing inference costs significantly.

Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes