CLSep 19, 2023
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation ProjectionJamil Zaghir, Mina Bjelogrlic, Jean-Philippe Goldman et al.
Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.
72.6CYApr 23
Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical ContentLydie Bednarczyk, Jamil Zaghir, Julien Ehrsam et al.
Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.
CVFeb 26, 2025Code
Tell me why: Visual foundation models as self-explainable classifiersHugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo et al.
Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.
CVJun 14, 2024Code
ProtoS-ViT: Visual foundation models for sparse self-explainable classificationsHugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo et al.
Prototypical networks aim to build intrinsically explainable models based on the linear summation of concepts. Concepts are coherent entities that we, as humans, can recognize and associate with a certain object or entity. However, important challenges remain in the fair evaluation of explanation quality provided by these models. This work first proposes an extensive set of quantitative and qualitative metrics which allow to identify drawbacks in current prototypical networks. It then introduces a novel architecture which provides compact explanations, outperforming current prototypical models in terms of explanation quality. Overall, the proposed architecture demonstrates how frozen pre-trained ViT backbones can be effectively turned into prototypical models for both general and domain-specific tasks, in our case biomedical image classifiers. Code is available at \url{https://github.com/hturbe/protosvit}.
CLMay 2, 2024
Prompt engineering paradigms for medical applications: scoping review and recommendations for better practicesJamil Zaghir, Marco Naguib, Mina Bjelogrlic et al.
Prompt engineering is crucial for harnessing the potential of large language models (LLMs), especially in the medical domain where specialized terminology and phrasing is used. However, the efficacy of prompt engineering in the medical domain remains to be explored. In this work, 114 recent studies (2022-2024) applying prompt engineering in medicine, covering prompt learning (PL), prompt tuning (PT), and prompt design (PD) are reviewed. PD is the most prevalent (78 articles). In 12 papers, PD, PL, and PT terms were used interchangeably. ChatGPT is the most commonly used LLM, with seven papers using it for processing sensitive clinical data. Chain-of-Thought emerges as the most common prompt engineering technique. While PL and PT articles typically provide a baseline for evaluating prompt-based approaches, 64% of PD studies lack non-prompt-related baselines. We provide tables and figures summarizing existing work, and reporting recommendations to guide future research contributions.
CLOct 13, 2025
FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisationJohann Pignat, Milena Vucetic, Christophe Gaudet-Blavignac et al.
Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.
LGFeb 11, 2022
Evaluation of post-hoc interpretability methods in time-series classificationHugues Turbé, Mina Bjelogrlic, Christian Lovis et al.
Post-hoc interpretability methods are critical tools to explain neural-network results. Several post-hoc methods have emerged in recent years, but when applied to a given task, they produce different results, raising the question of which method is the most suitable to provide correct post-hoc interpretability. To understand the performance of each method, quantitative evaluation of interpretability methods is essential. However, currently available frameworks have several drawbacks which hinders the adoption of post-hoc interpretability methods, especially in high-risk sectors. In this work, we propose a framework with quantitative metrics to assess the performance of existing post-hoc interpretability methods in particular in time series classification. We show that several drawbacks identified in the literature are addressed, namely dependence on human judgement, retraining, and shift in the data distribution when occluding samples. We additionally design a synthetic dataset with known discriminative features and tunable complexity. The proposed methodology and quantitative metrics can be used to understand the reliability of interpretability methods results obtained in practical applications. In turn, they can be embedded within operational workflows in critical fields that require accurate interpretability results for e.g., regulatory policies.