LGCLJan 14

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

arXiv:2601.09142v1
Originality Highly original
AI Analysis

This work addresses financial transparency by providing a benchmark and method for detecting evasive answers in earnings calls, representing a novel method for a known bottleneck.

The paper tackled the problem of detecting evasive answers in financial Q&A by introducing EvasionBench, a large-scale benchmark with 30,000 training and 1,000 test samples, and developed a multi-model annotation framework that improved accuracy by 2.4 percent, resulting in a model achieving 81.3 percent accuracy.

Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes