CLOct 25, 2023
Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language ModelsPaul Youssef, Osman Alperen Koraş, Meijie Li et al.
Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for factual knowledge. Our contributions are: (1) We propose a categorization scheme for factual probing methods that is based on how their inputs, outputs and the probed PLMs are adapted; (2) We provide an overview of the datasets used for factual probing; (3) We synthesize insights about knowledge retention and prompt optimization in PLMs, analyze obstacles to adopting PLMs as knowledge bases and outline directions for future work.
CVSep 19, 2024
Investigating the Impact of Randomness on Reproducibility in Computer Vision: A Study on Applications in Civil Engineering and MedicineBahadır Eryılmaz, Osman Alperen Koraş, Jörg Schlötterer et al.
Reproducibility is essential for scientific research. However, in computer vision, achieving consistent results is challenging due to various factors. One influential, yet often unrecognized, factor is CUDA-induced randomness. Despite CUDA's advantages for accelerating algorithm execution on GPUs, if not controlled, its behavior across multiple executions remains non-deterministic. While reproducibility issues in ML being researched, the implications of CUDA-induced randomness in application are yet to be understood. Our investigation focuses on this randomness across one standard benchmark dataset and two real-world datasets in an isolated environment. Our results show that CUDA-induced randomness can account for differences up to 4.77% in performance scores. We find that managing this variability for reproducibility may entail increased runtime or reduce performance, but that disadvantages are not as significant as reported in previous studies.
CLApr 5, 2024Code
Does Biomedical Training Lead to Better Medical Performance?Amin Dada, Marie Bauer, Amanda Butler Contreras et al.
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.
CLFeb 4
Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model MergingSameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş et al.
Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
CLMar 5, 2024
A Second Look on BASS -- Boosting Abstractive Summarization with Unified Semantic Graphs -- A Replication StudyOsman Alperen Koraş, Jörg Schlötterer, Christin Seifert
We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.
CLFeb 24, 2025
Towards Conditioning Clinical Text Generation for User ControlOsman Alperen Koraş, Rabi Bahnan, Jens Kleesiek et al.
Deploying natural language generation systems in clinical settings remains challenging despite advances in Large Language Models (LLMs), which continue to exhibit hallucinations and factual inconsistencies, necessitating human oversight. This paper explores automated dataset augmentation using LLMs as human proxies to condition LLMs for clinician control without increasing cognitive workload. On the BioNLP ACL'24 Discharge Me! Shared Task, we achieve new state-of-the-art results with simpler methods than prior submissions through more efficient training, yielding a 9\% relative improvement without augmented training and up to 34\% with dataset augmentation. Preliminary human evaluation further supports the effectiveness of our approach, highlighting the potential of augmenting clinical text generation for control to enhance relevance, accuracy, and factual consistency.