CLMar 26, 2025

Low-resource Information Extraction with the European Clinical Case Corpus

Soumitra Ghosh, Begona Altuna, Saeed Farzi, Pietro Ferrazzi, Alberto Lavelli, Giulia Mezzanotte, Manuela Speranza, Bernardo Magnini

arXiv:2503.20568v16.73 citationsh-index: 35Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of data scarcity for information extraction in clinical settings across multiple languages, though it is incremental as it builds on existing methods with a new dataset.

The authors tackled low-resource information extraction in the medical domain by creating E3C-3.0, a multilingual dataset of clinical cases annotated with diseases and test-result relations, and showed that fine-tuning state-of-the-art LLMs on it improves performance and enables effective transfer learning across languages to mitigate data scarcity.

We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

View on arXiv PDF

Similar