Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

arXiv:2605.3052917.0h-index: 48

Predicted impact top 74% in CL · last 90 daysOriginality Highly original

AI Analysis

This work provides an open recipe for building domain-specific medical retrievers from LLM-generated data, which is significant for improving clinical coding search accuracy in non-English languages for healthcare professionals and systems.

This paper addresses the degradation of recall in clinical retrieval for non-English languages, specifically for ICD-10-CM / CIE-10 codes. The authors developed a two-stage retriever using a Spanish biomedical encoder fine-tuned on Gemini-generated synthetic data across six languages. Their bi-encoder alone achieved an MRR of 0.876 and R@5 of 0.804, matching or exceeding BioBERT-ST, and with a cross-encoder reranker, R@5 improved to 0.822, significantly outperforming BioBERT-ST in several non-English languages (e.g., Portuguese R@5 0.829 vs. 0.714).

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

View on arXiv PDF

Similar