EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
This work addresses financial transparency by providing a benchmark and method for detecting evasive answers in earnings calls, representing a novel method for a known bottleneck.
The paper tackled the problem of detecting evasive answers in financial Q&A by introducing EvasionBench, a large-scale benchmark with 30,000 training and 1,000 test samples, and developed a multi-model annotation framework that improved accuracy by 2.4 percent, resulting in a model achieving 81.3 percent accuracy.
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.