LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
For biomedical NLP practitioners, LongBEL addresses the problem of inconsistent entity linking within documents, but the gains are primarily in document-level consistency rather than isolated disambiguation, making it an incremental improvement.
LongBEL introduces a document-level generative framework for biomedical entity linking that uses full-document context and a memory of previous predictions, achieving improvements over sentence-level baselines on five benchmarks across three languages, with the largest gains on recurring concepts.
Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.