Nathan Miller

12.2AIJun 1

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori et al.

Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evaluation approaches fundamentally different from academic benchmarks. Spider and BIRD established execution-accuracy protocols; G-Eval and RAGAS advanced LLM-based assessment; and recent work such as Spider 2.0, BEAVER, and BIRD-Interact has begun to address enterprise and agentic dimensions. No single framework unifies text-to-SQL assessment with agentic behavior evaluation into a production-grade pipeline calibrated against human expert judgment. We present BADGER, developed at Merkle, a unified evaluation framework integrating text-to-SQL assessment with agentic behavior evaluation. BADGER offers three contributions. First, LLM-assisted SQL component extraction extending Spider methodology to handle CTE-heavy, dialect-specific SQL. Second, a hybrid execution accuracy metric (Hybrid-EX) resolving column-aliasing and numeric-tolerance brittleness by using an LLM to infer structural alignments before deterministic cell-level scoring. Validated on 150 human-annotated industry queries, Hybrid-EX achieves Cohen's kappa=0.717 [95% CI: 0.600-0.822] (Substantial agreement) and 87.3% balanced accuracy, outperforming all six competing frameworks (Delta-kappa: 0.322-0.502, all p<=0.001). Third, an enterprise agentic evaluation suite assembling RAGAS, G-Eval, and agent benchmark metrics into a unified pipeline; Excess Tool Usage is the sole novel element. BADGER runs entirely within the client's governed data environment, supports configurable LLM judge backends, and enables rapid prototyping of client-specific judges and metrics, serving as a continuous evaluation backbone rather than a one-time quality gate.

LGFeb 3, 2023

Towards a responsible machine learning approach to identify forced labor in fisheries

Rocío Joo, Gavin McDonald, Nathan Miller et al.

Many fishing vessels use forced labor, but identifying vessels that engage in this practice is challenging because few are regularly inspected. We developed a positive-unlabeled learning algorithm using vessel characteristics and movement patterns to estimate an upper bound of the number of positive cases of forced labor, with the goal of helping make accurate, responsible, and fair decisions. 89% of the reported cases of forced labor were correctly classified as positive (recall) while 98% of the vessels certified as having decent working conditions were correctly classified as negative. The recall was high for vessels from different regions using different gears, except for trawlers. We found that as much as ~28% of vessels may operate using forced labor, with the fraction much higher in squid jiggers and longlines. This model could inform risk-based port inspections as part of a broader monitoring, control, and surveillance regime to reduce forced labor. * Translated versions of the English title and abstract are available in five languages in S1 Text: Spanish, French, Simplified Chinese, Traditional Chinese, and Indonesian.

Nathan Miller

2 Papers