CLMar 11

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

arXiv:2603.10570v122.9h-index: 9
Predicted impact top 88% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This provides a scalable solution for evaluating chatbots with minimal manual effort, though it is incremental as it builds on existing LLM and retrieval-augmented generation methods.

The paper tackles the problem of unreliable evaluation for domain-specific chatbots by proposing an end-to-end automatic evaluator that generates Q&A pairs from a knowledge base and uses LLMs with confidence filtering to judge responses. Applied to a Vietnamese news dataset, it achieves high agreement with human judgments while significantly reducing review overhead.

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes