CLAug 29, 2024Code
Learning from Negative Samples in Biomedical Generative Entity LinkingChanhwi Kim, Hyunjae Kim, Sihyeon Park et al.
Generative models have become widely used in biomedical entity linking (BioEL) due to their excellent performance and efficient memory usage. However, these models are usually trained only with positive samples, i.e., entities that match the input mention's identifier, and do not explicitly learn from hard negative samples, which are entities that look similar but have different meanings. To address this limitation, we introduce ANGEL (Learning from Negative Samples in Biomedical Generative Entity Linking), the first framework that trains generative BioEL models using negative samples. Specifically, a generative model is initially trained to generate positive entity names from the knowledge base for given input entities. Subsequently, both correct and incorrect outputs are gathered from the model's top-k predictions. Finally, the model is updated to prioritize the correct predictions through preference optimization. Our models outperform the previous best baseline models by up to an average top-1 accuracy of 1.4% on five benchmarks. When incorporating our framework into pre-training, the performance improvement increases further to 1.7%, demonstrating its effectiveness in both the pre-training and fine-tuning stages. The code and model weights are available at https://github.com/dmis-lab/ANGEL.
CLOct 14, 2022
Automatic Creation of Named Entity Recognition Datasets by Querying Phrase RepresentationsHyunjae Kim, Jaehyo Yoo, Seunghyun Yoon et al.
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
CLJan 29
A Federated and Parameter-Efficient Framework for Large Language Model Training in MedicineAnran Li, Yuanyuan Chen, Wenjun Long et al.
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.
CLFeb 3, 2023
LIQUID: A Framework for List Question Answering Dataset GenerationSeongyun Lee, Hyunjae Kim, Jaewoo Kang
Question answering (QA) models often rely on large-scale training datasets, which necessitates the development of a data generation framework to reduce the cost of manual annotations. Although several recent studies have aimed to generate synthetic questions with single-span answers, no study has been conducted on the creation of list questions with multiple, non-contiguous spans as answers. To address this gap, we propose LIQUID, an automated framework for generating list QA datasets from unlabeled corpora. We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers. This allows us to select answers that are semantically correlated in context and is, therefore, suitable for constructing list questions. We then create questions using an off-the-shelf question generator with the extracted entities and original passage. Finally, iterative filtering and answer expansion are performed to ensure the accuracy and completeness of the answers. Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
CVMay 1
Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in DermatologyRoy Jiang, Hyunjae Kim, Zhenyue Qin et al.
Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.
CLJul 10, 2023
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report SummarizationGangwoo Kim, Hajung Kim, Lei Ji et al.
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively learn the required knowledge and skills from limited resources in the domain. Demonstrating superior performance on the benchmark datasets provided by the BioNLP shared task, our model benefits from its training across multiple tasks and domains. With subtle techniques including ensemble and factual calibration, our system achieves first place on the RadSum23 leaderboard for the hidden test set.
CLMar 30, 2024Code
Small Language Models Learn Enhanced Reasoning Skills from Medical TextbooksHyunjae Kim, Hyeon Hwang, Jiwoo Lee et al.
While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.
CLNov 1, 2024Code
Rationale-Guided Retrieval Augmented Generation for Medical Question AnsweringJiwoong Sohn, Yein Park, Chanwoong Yoon et al.
Large language models (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG$^2$ (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG$^2$ incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG$^2$ improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1\%, and it outperforms the previous best medical RAG model by up to 5.6\% across three medical question-answering benchmarks. Our code is available at https://github.com/dmis-lab/RAG2.
CLOct 22, 2024Code
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information CoverageTaewhoo Lee, Chanwoong Yoon, Kyochul Jang et al.
Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.
CLJun 13, 2025Code
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process RewardsJaehoon Yun, Jiwoong Sohn, Jungwoo Park et al.
Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80\% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/
CLNov 10, 2025
Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical InsightsHyunjae Kim, Jiwoong Sohn, Aidan Gilson et al.
Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.
CLJan 13
Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-ReasoningFan Gao, Sherry T. Tong, Jiwoong Sohn et al.
While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
CVJan 25Code
Benchmarking Direct Preference Optimization for Medical Large Vision-Language ModelsDain Kim, Jiwoo Lee, Jaehoon Yun et al.
Large Vision-Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high-stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA-Med and HuatuoGPT-Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine-tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof-of-concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question-answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at https://github.com/dmis-lab/med-vlm-dpo.
CVFeb 23, 2024
Fine-tuning CLIP Text Encoders with Two-step ParaphrasingHyunjae Kim, Seunghyun Yoon, Trung Bui et al.
Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.
CLJan 20, 2025
Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation StudySahana Srinivasan, Xuguang Ai, Minjie Zou et al.
Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.
CLApr 15, 2025
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 ItemsMinjie Zou, Sahana Srinivasan, Thaddaeus Wai Soon Lo et al.
Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.
AIMay 1, 2024
CookingSense: A Culinary Knowledgebase with Multidisciplinary AssertionsDonghee Choi, Mogan Gim, Donghyeon Park et al.
This paper introduces CookingSense, a descriptive collection of knowledge assertions in the culinary domain extracted from various sources, including web data, scientific papers, and recipes, from which knowledge covering a broad range of aspects is acquired. CookingSense is constructed through a series of dictionary-based filtering and language model-based semantic filtering techniques, which results in a rich knowledgebase of multidisciplinary food-related assertions. Additionally, we present FoodBench, a novel benchmark to evaluate culinary decision support systems. From evaluations with FoodBench, we empirically prove that CookingSense improves the performance of retrieval augmented language models. We also validate the quality and variety of assertions in CookingSense through qualitative analysis.
CVNov 27, 2025
From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and ValidationZhen Chen, Yihang Fu, Gabriel Madera et al.
Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
CLJul 21, 2025
BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and ReasoningSahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo et al.
Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.
CLJun 15, 2024
Augmenting Biomedical Named Entity Recognition with General-domain ResourcesYu Yin, Hyunjae Kim, Xiao Xiao et al.
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets.
CLDec 16, 2021
Simple Questions Generate Named Entity Recognition DatasetsHyunjae Kim, Jaehyo Yoo, Seunghyun Yoon et al.
Recent named entity recognition (NER) models often rely on human-annotated datasets, requiring the significant engagement of professional knowledge on the target domain and entities. This research introduces an ask-to-generate approach that automatically generates NER datasets by asking questions in simple natural language to an open-domain question answering system (e.g., "Which disease?"). Despite using fewer in-domain resources, our models, solely trained on the generated datasets, largely outperform strong low-resource models by an average F1 score of 19.4 for six popular NER benchmarks. Furthermore, our models provide competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by an F1 score of 5.2 on three benchmarks and achieve new state-of-the-art performance.
CLNov 20, 2021
Improving Tagging Consistency and Entity Coverage for Chemical Identification in Full-text ArticlesHyunjae Kim, Mujeen Sung, Wonjin Yoon et al.
This paper is a technical report on our system submitted to the chemical identification task of the BioCreative VII Track 2 challenge. The main feature of this challenge is that the data consists of full-text articles, while current datasets usually consist of only titles and abstracts. To effectively address the problem, we aim to improve tagging consistency and entity coverage using various methods such as majority voting within the same articles for named entity recognition (NER) and a hybrid approach that combines a dictionary and a neural model for normalization. In the experiments on the NLM-Chem dataset, we show that our methods improve models' performance, particularly in terms of recall. Finally, in the official evaluation of the challenge, our system was ranked 1st in NER by significantly outperforming the baseline model and more than 80 submissions from 16 teams.
CLJun 22, 2021
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question AnsweringGangwoo Kim, Hyunjae Kim, Jungsoo Park et al.
One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models' performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.
CLJan 15, 2021
"Killing Me" Is Not a Spoiler: Spoiler Detection Model using Graph Neural Networks with Dependency Relation-Aware Attention MechanismBuru Chang, Inggeol Lee, Hyunjae Kim et al.
Several machine learning-based spoiler detection models have been proposed recently to protect users from spoilers on review websites. Although dependency relations between context words are important for detecting spoilers, current attention-based spoiler detection models are insufficient for utilizing dependency relations. To address this problem, we propose a new spoiler detection model called SDGNN that is based on syntax-aware graph neural networks. In the experiments on two real-world benchmark datasets, we show that our SDGNN outperforms the existing spoiler detection models.
CLJan 1, 2021
How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?Hyunjae Kim, Jaewoo Kang
The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that although current best models achieve state-of-the-art performance on benchmarks based on overall performance, they have limitations in identifying synonyms and new biomedical concepts, indicating they are overestimated in terms of their generalization abilities. We also investigate failure cases of models and identify several difficulties in recognizing unseen mentions in biomedical literature as follows: (1) models tend to exploit dataset biases, which hinders the models' abilities to generalize, and (2) several biomedical names have novel morphological patterns with weak name regularity, and models fail to recognize them. We apply a statistics-based debiasing method to our problem as a simple remedy and show the improvement in generalization to unseen mentions. We hope that our analyses and findings would be able to facilitate further research into the generalization capabilities of NER models in a domain where their reliability is of utmost importance.
CLApr 30, 2020
Look at the First Sentence: Position Bias in Question AnsweringMiyoung Ko, Jinhyuk Lee, Hyunjae Kim et al.
Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only in the k-th sentence of each passage), QA models predicting answers as positions can learn spurious positional cues and fail to give answers in different positions. We first illustrate this position bias in popular extractive QA models such as BiDAF and BERT and thoroughly examine how position bias propagates through each layer of BERT. To safely deliver position information without position bias, we train models with various de-biasing methods including entropy regularization and bias ensembling. Among them, we found that using the prior distribution of answer positions as a bias model is very effective at reducing position bias, recovering the performance of BERT from 37.48% to 81.64% when trained on a biased SQuAD dataset.
AIApr 11, 2020
Exploring The Spatial Reasoning Ability of Neural Models in Human IQ TestsHyunjae Kim, Yookyung Koh, Jinheon Baek et al.
Although neural models have performed impressively well on various tasks such as image recognition and question answering, their reasoning ability has been measured in only few studies. In this work, we focus on spatial reasoning and explore the spatial understanding of neural models. First, we describe the following two spatial reasoning IQ tests: rotation and shape composition. Using well-defined rules, we constructed datasets that consist of various complexity levels. We designed a variety of experiments in terms of generalization, and evaluated six different baseline models on the newly generated datasets. We provide an analysis of the results and factors that affect the generalization abilities of models. Also, we analyze how neural models solve spatial reasoning tests with visual aids. Our findings would provide valuable insights into understanding a machine and the difference between a machine and human.
ASApr 9, 2020
Fast frequency discrimination and phoneme recognition using a biomimetic membrane coupled to a neural networkWoo Seok Lee, Hyunjae Kim, Andrew N. Cleland et al.
In the human ear, the basilar membrane plays a central role in sound recognition. When excited by sound, this membrane responds with a frequency-dependent displacement pattern that is detected and identified by the auditory hair cells combined with the human neural system. Inspired by this structure, we designed and fabricated an artificial membrane that produces a spatial displacement pattern in response to an audible signal, which we used to train a convolutional neural network (CNN). When trained with single frequency tones, this system can unambiguously distinguish tones closely spaced in frequency. When instead trained to recognize spoken vowels, this system outperforms existing methods for phoneme recognition, including the discrete Fourier transform (DFT), zoom FFT and chirp z-transform, especially when tested in short time windows. This sound recognition scheme therefore promises significant benefits in fast and accurate sound identification compared to existing methods.
LGMar 25, 2019
Predicting Multiple Demographic Attributes with Task Specific Embedding Transformation and Attention NetworkRaehyun Kim, Hyunjae Kim, Janghyuk Lee et al.
Most companies utilize demographic information to develop their strategy in a market. However, such information is not available to most retail companies. Several studies have been conducted to predict the demographic attributes of users from their transaction histories, but they have some limitations. First, they focused on parameter sharing to predict all attributes but capturing task-specific features is also important in multi-task learning. Second, they assumed that all transactions are equally important in predicting demographic attributes. However, some transactions are more useful than others for predicting a certain attribute. Furthermore, decision making process of models cannot be interpreted as they work in a black-box manner. To address the limitations, we propose an Embedding Transformation Network with Attention (ETNA) model which shares representations at the bottom of the model structure and transforms them to task-specific representations using a simple linear transformation method. In addition, we can obtain more informative transactions for predicting certain attributes using the attention mechanism. The experimental results show that our model outperforms the previous models on all tasks. In our qualitative analysis, we show the visualization of attention weights, which provides business managers with some useful insights.
CLOct 1, 2018
Ranking Paragraphs for Improving Answer Recall in Open-Domain Question AnsweringJinhyuk Lee, Seongjun Yun, Hyunjae Kim et al.
Recently, open-domain question answering (QA) has been combined with machine comprehension models to find answers in a large knowledge source. As open-domain QA requires retrieving relevant documents from text corpora to answer questions, its performance largely depends on the performance of document retrievers. However, since traditional information retrieval systems are not effective in obtaining documents with a high probability of containing answers, they lower the performance of QA systems. Simply extracting more documents increases the number of irrelevant documents, which also degrades the performance of QA systems. In this paper, we introduce Paragraph Ranker which ranks paragraphs of retrieved documents for a higher answer recall with less noise. We show that ranking paragraphs and aggregating answers using Paragraph Ranker improves performance of open-domain QA pipeline on the four open-domain QA datasets by 7.8% on average.
CVJul 2, 2018
Liver Lesion Detection from Weakly-labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox DetectorSang-gil Lee, Jae Seok Bae, Hyunjae Kim et al.
We present a focal liver lesion detection model leveraged by custom-designed multi-phase computed tomography (CT) volumes, which reflects real-world clinical lesion detection practice using a Single Shot MultiBox Detector (SSD). We show that grouped convolutions effectively harness richer information of the multi-phase data for the object detection model, while a naive application of SSD suffers from a generalization gap. We trained and evaluated the modified SSD model and recently proposed variants with our CT dataset of 64 subjects by five-fold cross validation. Our model achieved a 53.3% average precision score and ran in under three seconds per volume, outperforming the original model and state-of-the-art variants. Results show that the one-stage object detection model is a practical solution, which runs in near real-time and can learn an unbiased feature representation from a large-volume real-world detection dataset, which requires less tedious and time consuming construction of the weak phase-level bounding box labels.
NENov 15, 2016
Intrinsic Geometric Information Transfer Learning on Multiple Graph-Structured DatasetsJaekoo Lee, Hyunjae Kim, Jongsun Lee et al.
Graphs provide a powerful means for representing complex interactions between entities. Recently, deep learning approaches are emerging for representing and modeling graph-structured data, although the conventional deep learning methods (such as convolutional neural networks and recurrent neural networks) have mainly focused on grid-structured inputs (image and audio). Leveraged by the capability of representation learning, deep learning based techniques are reporting promising results for graph applications by detecting structural characteristics of graphs in an automated fashion. In this paper, we attempt to advance deep learning for graph-structured data by incorporating another component, transfer learning. By transferring the intrinsic geometric information learned in the source domain, our approach can help us to construct a model for a new but related task in the target domain without collecting new data and without training a new model from scratch. We thoroughly test our approach with large-scale real corpora and confirm the effectiveness of the proposed transfer learning framework for deep learning on graphs. According to our experiments, transfer learning is most effective when the source and target domains bear a high level of structural similarity in their graph representations.