CL AIMay 29

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan

arXiv:2605.3082688.0h-index: 10

Predicted impact top 40% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a method to create a higher-yield review queue for human curators working with biomedical entity recognition tasks, making the curation process more efficient by filtering out incorrect LLM suggestions.

This paper addresses the challenge of identifying correct biomedical entities from LLM outputs, where simple agreement among LLMs does not guarantee correctness due to annotation conventions. The authors introduce BioConCal, a supervised scorer that uses agreement, mention, surface-availability, and document features to verify panel-surfaced biomedical entity candidates. BioConCal significantly improves AUROC from 0.753 to 0.910 and, at 0.95 precision, selects 1,340 candidates with 0.939 empirical test precision, compared to 293 for raw agreement.

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

View on arXiv PDF

Similar