Edward Choi

LG
h-index54
106papers
11,463citations
Novelty45%
AI Score62

106 Papers

95.3LGMay 28Code
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

Eunbyeol Cho, Yunseung Lee, Mirae Kim et al.

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

CVAug 3, 2023Code
VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Jiyoung Lee, Seungho Kim, Seunghyun Won et al. · nvidia, utoronto

AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Given that most large-scale deep learning models act as black boxes and cannot be manually controlled, analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification, a fundamental task in machine perception. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios that may arise in the real world and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at https://github.com/jiyounglee-0523/VisAlign.

CLJan 16, 2023Code
EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Gyubok Lee, Hyeonji Hwang, Seongsu Bae et al.

We present a new text-to-SQL dataset for electronic health records (EHRs). The utterances were collected from 222 hospital staff members, including physicians, nurses, and insurance review and health records teams. To construct the QA dataset on structured EHR data, we conducted a poll at a university hospital and used the responses to create seed questions. We then manually linked these questions to two open-source EHR databases, MIMIC-III and eICU, and included various time expressions and held-out unanswerable questions in the dataset, which were also collected from the poll. Our dataset poses a unique set of challenges: the model needs to 1) generate SQL queries that reflect a wide range of needs in the hospital, including simple retrieval and complex operations such as calculating survival rate, 2) understand various time expressions to answer time-sensitive questions in healthcare, and 3) distinguish whether a given question is answerable or unanswerable. We believe our dataset, EHRSQL, can serve as a practical benchmark for developing and assessing QA models on structured EHR data and take a step further towards bridging the gap between text-to-SQL research and its real-life deployment in healthcare. EHRSQL is available at https://github.com/glee4810/EHRSQL.

CLMar 9, 2023Code
Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports

Hyunseung Chung, Jiho Kim, Joon-myoung Kwon et al.

Electrocardiogram (ECG) synthesis is the area of research focused on generating realistic synthetic ECG signals for medical use without concerns over annotation costs or clinical data privacy restrictions. Traditional ECG generation models consider a single ECG lead and utilize GAN-based generative models. These models can only generate single lead samples and require separate training for each diagnosis class. The diagnosis classes of ECGs are insufficient to capture the intricate differences between ECGs depending on various features (e.g. patient demographic details, co-existing diagnosis classes, etc.). To alleviate these challenges, we present a text-to-ECG task, in which textual inputs are used to produce ECG outputs. Then we propose Auto-TTE, an autoregressive generative model conditioned on clinical text reports to synthesize 12-lead ECGs, for the first time to our knowledge. We compare the performance of our model with other representative models in text-to-speech and text-to-image. Experimental results show the superiority of our model in various quantitative evaluations and qualitative analysis. Finally, we conduct a user study with three board-certified cardiologists to confirm the fidelity and semantic alignment of generated samples. our code will be available at https://github.com/TClife/text_to_ecg

QMJun 21, 2023Code
ECG-QA: A Comprehensive Question Answering Dataset Combined With Electrocardiogram

Jungwoo Oh, Gyubok Lee, Seongsu Bae et al.

Question answering (QA) in the field of healthcare has received much attention due to significant advancements in natural language processing. However, existing healthcare QA datasets primarily focus on medical images, clinical notes, or structured electronic health record tables. This leaves the vast potential of combining electrocardiogram (ECG) data with these systems largely untapped. To address this gap, we present ECG-QA, the first QA dataset specifically designed for ECG analysis. The dataset comprises a total of 70 question templates that cover a wide range of clinically relevant ECG topics, each validated by an ECG expert to ensure their clinical utility. As a result, our dataset includes diverse ECG interpretation questions, including those that require a comparative analysis of two different ECGs. In addition, we have conducted numerous experiments to provide valuable insights for future research directions. We believe that ECG-QA will serve as a valuable resource for the development of intelligent QA systems capable of assisting clinicians in ECG interpretations. Dataset URL: https://github.com/Jwoo5/ecg-qa

CLMar 18, 2022Code
Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Sungjin Park, Seongsu Bae, Jiho Kim et al.

As the volume of Electronic Health Records (EHR) sharply grows, there has been emerging interest in learning the representation of EHR for healthcare applications. Representation learning of EHR requires appropriate modeling of the two dominant modalities in EHR: structured data and unstructured text. In this paper, we present MedGTX, a pre-trained model for multi-modal representation learning of the structured and textual EHR data. MedGTX uses a novel graph encoder to exploit the graphical nature of structured EHR data, and a text encoder to handle unstructured text, and a cross-modal encoder to learn a joint representation space. We pre-train our model through four proxy tasks on MIMIC-III, an open-source EHR data, and evaluate our model on two clinical benchmarks and three novel downstream tasks which tackle real-world problems in EHR data. The results consistently show the effectiveness of pre-training the model for joint representation of both structured and unstructured information from EHR. Given the promising performance of MedGTX, we believe this work opens a new door to jointly understanding the two fundamental modalities of EHR data.

CVApr 15, 2022Code
Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Hyungyung Lee, Sungjin Park, Joonseok Lee et al.

Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.

LGDec 14, 2022
Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders

Jongseong Jang, Daeun Kyung, Seung Hwan Kim et al.

Deep neural networks are increasingly used in medical imaging for tasks such as pathological classification, but they face challenges due to the scarcity of high-quality, expert-labeled training data. Recent efforts have utilized pre-trained contrastive image-text models like CLIP, adapting them for medical use by fine-tuning the model with chest X-ray images and corresponding reports for zero-shot pathology classification, thus eliminating the need for pathology-specific annotations. However, most studies continue to use the same contrastive learning objectives as in the general domain, overlooking the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy that includes positive-pair loss relaxation and random sentence sampling. We aim to improve the performance of zero-shot pathology classification without relying on external knowledge. Our method can be applied to any pre-trained contrastive image-text encoder and easily transferred to out-of-domain datasets without further training, as it does not use external data. Our approach consistently improves overall zero-shot pathology classification across four chest X-ray datasets and three pre-trained models, with an average macro AUROC increase of 4.3%. Additionally, our method outperforms the state-of-the-art and marginally surpasses board-certified radiologists in zero-shot classification for the five competition pathologies in the CheXpert dataset.

CVOct 17, 2022Code
Correlation between Alignment-Uniformity and Performance of Dense Contrastive Representations

Jong Hak Moon, Wonjae Kim, Edward Choi

Recently, dense contrastive learning has shown superior performance on dense prediction tasks compared to instance-level contrastive learning. Despite its supremacy, the properties of dense contrastive representations have not yet been carefully studied. Therefore, we analyze the theoretical ideas of dense contrastive learning using a standard CNN and straightforward feature matching scheme rather than propose a new complex method. Inspired by the analysis of the properties of instance-level contrastive representations through the lens of alignment and uniformity on the hypersphere, we employ and extend the same lens for the dense contrastive representations to analyze their underexplored properties. We discover the core principle in constructing a positive pair of dense features and empirically proved its validity. Also, we introduces a new scalar metric that summarizes the correlation between alignment-and-uniformity and downstream performance. Using this metric, we study various facets of densely learned contrastive representations such as how the correlation changes over single- and multi-object datasets or linear evaluation and dense prediction tasks. The source code is publicly available at: https://github.com/SuperSupermoon/DenseCL-analysis

CLOct 28, 2023Code
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu et al.

Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research. EHRXQA is available at https://github.com/baeseongsu/ehrxqa.

CLAug 29, 2022Code
Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity

Taehee Kim, ChaeHun Park, Jimin Hong et al.

Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts.

IVMar 9, 2023
Perspective Projection-Based 3D CT Reconstruction from Biplanar X-rays

Daeun Kyung, Kyungmin Jo, Jaegul Choo et al.

X-ray computed tomography (CT) is one of the most common imaging techniques used to diagnose various diseases in the medical field. Its high contrast sensitivity and spatial resolution allow the physician to observe details of body parts such as bones, soft tissue, blood vessels, etc. As it involves potentially harmful radiation exposure to patients and surgeons, however, reconstructing 3D CT volume from perpendicular 2D X-ray images is considered a promising alternative, thanks to its lower radiation risk and better accessibility. This is highly challenging though, since it requires reconstruction of 3D anatomical information from 2D images with limited views, where all the information is overlapped. In this paper, we propose PerX2CT, a novel CT reconstruction framework from X-ray that reflects the perspective projection scheme. Our proposed method provides a different combination of features for each coordinate which implicitly allows the model to obtain information about the 3D location. We reveal the potential to reconstruct the selected part of CT with high resolution by properly using the coordinate-wise local and global features. Our approach shows potential for use in clinical applications with low computational complexity and fast inference time, demonstrating superior performance than baselines in multiple evaluation metrics.

CVMay 29, 2022
Revisiting the Importance of Amplifying Bias for Debiasing

Jungsoo Lee, Jeonghoon Park, Daeyoung Kim et al.

In image classification, "debiasing" aims to train a classifier to be less susceptible to dataset bias, the strong correlation between peripheral attributes of data samples and a target class. For example, even if the frog class in the dataset mainly consists of frog images with a swamp background (i.e., bias-aligned samples), a debiased classifier should be able to correctly classify a frog at a beach (i.e., bias-conflicting samples). Recent debiasing approaches commonly use two components for debiasing, a biased model $f_B$ and a debiased model $f_D$. $f_B$ is trained to focus on bias-aligned samples (i.e., overfitted to the bias) while $f_D$ is mainly trained with bias-conflicting samples by concentrating on samples which $f_B$ fails to learn, leading $f_D$ to be less susceptible to the dataset bias. While the state-of-the-art debiasing techniques have aimed to better train $f_D$, we focus on training $f_B$, an overlooked component until now. Our empirical analysis reveals that removing the bias-conflicting samples from the training set for $f_B$ is important for improving the debiasing performance of $f_D$. This is due to the fact that the bias-conflicting samples work as noisy samples for amplifying the bias for $f_B$ since those samples do not include the bias attribute. To this end, we propose a simple yet effective data sample selection method which removes the bias-conflicting samples to construct a bias-amplified dataset for training $f_B$. Our data sample selection method can be directly applied to existing reweighting-based debiasing approaches, obtaining consistent performance boost and achieving the state-of-the-art performance on both synthetic and real-world datasets.

LGMar 14, 2022
Lead-agnostic Self-supervised Learning for Local and Global Representations of Electrocardiogram

Jungwoo Oh, Hyunseung Chung, Joon-myoung Kwon et al.

In recent years, self-supervised learning methods have shown significant improvement for pre-training with unlabeled data and have proven helpful for electrocardiogram signals. However, most previous pre-training methods for electrocardiogram focused on capturing only global contextual representations. This inhibits the models from learning fruitful representation of electrocardiogram, which results in poor performance on downstream tasks. Additionally, they cannot fine-tune the model with an arbitrary set of electrocardiogram leads unless the models were pre-trained on the same set of leads. In this work, we propose an ECG pre-training method that learns both local and global contextual representations for better generalizability and performance on downstream tasks. In addition, we propose random lead masking as an ECG-specific augmentation method to make our proposed model robust to an arbitrary set of leads. Experimental results on two downstream tasks, cardiac arrhythmia classification and patient identification, show that our proposed approach outperforms other state-of-the-art methods.

LGJul 26, 2022
Task Agnostic and Post-hoc Unseen Distribution Detection

Radhika Dua, Seongjun Yang, Yixuan Li et al.

Despite the recent advances in out-of-distribution(OOD) detection, anomaly detection, and uncertainty estimation tasks, there do not exist a task-agnostic and post-hoc approach. To address this limitation, we design a novel clustering-based ensembling method, called Task Agnostic and Post-hoc Unseen Distribution Detection (TAPUDD) that utilizes the features extracted from the model trained on a specific task. Explicitly, it comprises of TAP-Mahalanobis, which clusters the training datasets' features and determines the minimum Mahalanobis distance of the test sample from all clusters. Further, we propose the Ensembling module that aggregates the computation of iterative TAP-Mahalanobis for a different number of clusters to provide reliable and efficient cluster computation. Through extensive experiments on synthetic and real-world datasets, we observe that our approach can detect unseen samples effectively across diverse tasks and performs better or on-par with the existing baselines. To this end, we eliminate the necessity of determining the optimal value of the number of clusters and demonstrate that our method is more viable for large-scale classification tasks.

LGFeb 27, 2023
Deep Imbalanced Time-series Forecasting via Local Discrepancy Density

Junwoo Park, Jungsoo Lee, Youngin Cho et al.

Time-series forecasting models often encounter abrupt changes in a given period of time which generally occur due to unexpected or unknown events. Despite their scarce occurrences in the training set, abrupt changes incur loss that significantly contributes to the total loss. Therefore, they act as noisy training samples and prevent the model from learning generalizable patterns, namely the normal states. Based on our findings, we propose a reweighting framework that down-weights the losses incurred by abrupt changes and up-weights those by normal states. For the reweighting framework, we first define a measurement termed Local Discrepancy (LD) which measures the degree of abruptness of a change in a given period of time. Since a training set is mostly composed of normal states, we then consider how frequently the temporal changes appear in the training set based on LD. Our reweighting framework is applicable to existing time-series forecasting models regardless of the architectures. Through extensive experiments on 12 time-series forecasting models over eight datasets with various in-output sequence lengths, we demonstrate that applying our reweighting framework reduces MSE by 10.1% on average and by up to 18.6% in the state-of-the-art model.

LGJul 20, 2022
GenHPF: General Healthcare Predictive Framework with Multi-task Multi-source Learning

Kyunghoon Hur, Jungwoo Oh, Junu Kim et al.

Despite the remarkable progress in the development of predictive models for healthcare, applying these algorithms on a large scale has been challenging. Algorithms trained on a particular task, based on specific data formats available in a set of medical records, tend to not generalize well to other tasks or databases in which the data fields may differ. To address this challenge, we propose General Healthcare Predictive Framework (GenHPF), which is applicable to any EHR with minimal preprocessing for multiple prediction tasks. GenHPF resolves heterogeneity in medical codes and schemas by converting EHRs into a hierarchical textual representation while incorporating as many features as possible. To evaluate the efficacy of GenHPF, we conduct multi-task learning experiments with single-source and multi-source settings, on three publicly available EHR datasets with different schemas for 12 clinically meaningful prediction tasks. Our framework significantly outperforms baseline models that utilize domain knowledge in multi-source learning, improving average AUROC by 1.2%P in pooled learning and 2.6%P in transfer learning while also showing comparable results when trained on a single EHR dataset. Furthermore, we demonstrate that self-supervised pretraining using multi-source datasets is effective when combined with GenHPF, resulting in a 0.6%P AUROC improvement compared to models without pretraining. By eliminating the need for preprocessing and feature engineering, we believe that this work offers a solid framework for multi-task and multi-source learning that can be leveraged to speed up the scaling and usage of predictive algorithms in healthcare.

LGJul 7, 2022
Towards the Practical Utility of Federated Learning in the Medical Domain

Seongjun Yang, Hyeonji Hwang, Daeyoung Kim et al.

Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not provide a practical guide to applying FL in the medical domain. We propose empirical benchmarks and experimental settings for three representative medical datasets with different modalities: longitudinal electronic health records, skin cancer images, and electrocardiogram signals. The likely users of FL such as medical institutions and IT companies can take these benchmarks as guides for adopting FL and minimize their trial and error. For each dataset, each client data is from a different source to preserve real-world heterogeneity. We evaluate six FL algorithms designed for addressing data heterogeneity among clients, and a hybrid algorithm combining the strengths of two representative FL algorithms. Based on experiment results from three modalities, we discover that simple FL algorithms tend to outperform more sophisticated ones, while the hybrid algorithm consistently shows good, if not the best performance. We also find that a frequent global model update leads to better performance under a fixed training iteration budget. As the number of participating clients increases, higher cost is incurred due to increased IT administrators and GPUs, but the performance consistently increases. We expect future users will refer to these empirical benchmarks to design the FL experiments in the medical domain considering their clinical tasks and obtain stronger performance with lower costs.

LGAug 7, 2024Code
Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Hyunseung Chung, Sumin Jo, Yeonsu Kwon et al.

Despite the massive attention given to time-series explanations due to their extensive applications, a notable limitation in existing approaches is their primary reliance on the time-domain. This overlooks the inherent characteristic of time-series data containing both time and frequency features. In this work, we present Spectral eXplanation (SpectralX), an XAI framework that provides time-frequency explanations for time-series black-box classifiers. This easily adaptable framework enables users to "plug-in" various perturbation-based XAI methods for any pre-trained time-series classification models to assess their impact on the explanation quality without having to modify the framework architecture. Additionally, we introduce Feature Importance Approximations (FIA), a new perturbation-based XAI method. These methods consist of feature insertion, deletion, and combination techniques to enhance computational efficiency and class-specific explanations in time-series classification tasks. We conduct extensive experiments in the generated synthetic dataset and various UCR Time-Series datasets to first compare the explanation performance of FIA and other existing perturbation-based XAI methods in both time-domain and time-frequency domain, and then show the superiority of our FIA in the time-frequency domain with the SpectralX framework. Finally, we conduct a user study to confirm the practicality of our FIA in SpectralX framework for class-specific time-frequency based time-series explanations. The source code is available in https://github.com/gustmd0121/Time_is_not_Enough

CLOct 17, 2023
KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models

Jiho Kim, Yeonsu Kwon, Yohan Jo et al.

While large language models (LLMs) have made considerable advancements in understanding and generating unstructured text, their application in structured data remains underexplored. Particularly, using LLMs for complex reasoning tasks on knowledge graphs (KGs) remains largely untouched. To address this, we propose KG-GPT, a multi-purpose framework leveraging LLMs for tasks employing KGs. KG-GPT comprises three steps: Sentence Segmentation, Graph Retrieval, and Inference, each aimed at partitioning sentences, retrieving relevant graph components, and deriving logical conclusions, respectively. We evaluate KG-GPT using KG-based fact verification and KGQA benchmarks, with the model showing competitive and robust performance, even outperforming several fully-supervised models. Our work, therefore, marks a significant step in unifying structured and unstructured data processing within the realm of LLMs.

IVFeb 23, 2023
Vision-Language Generative Model for View-Specific Chest X-ray Generation

Hyungyung Lee, Da Young Lee, Wonjae Kim et al.

Synthetic medical data generation has opened up new possibilities in the healthcare domain, offering a powerful tool for simulating clinical scenarios, enhancing diagnostic and treatment quality, gaining granular medical knowledge, and accelerating the development of unbiased algorithms. In this context, we present a novel approach called ViewXGen, designed to overcome the limitations of existing methods that rely on general domain pipelines using only radiology reports to generate frontal-view chest X-rays. Our approach takes into consideration the diverse view positions found in the dataset, enabling the generation of chest X-rays with specific views, which marks a significant advancement in the field. To achieve this, we introduce a set of specially designed tokens for each view position, tailoring the generation process to the user's preferences. Furthermore, we leverage multi-view chest X-rays as input, incorporating valuable information from different views within the same study. This integration rectifies potential errors and contributes to faithfully capturing abnormal findings in chest X-ray generation. To validate the effectiveness of our approach, we conducted statistical analyses, evaluating its performance in a clinical efficacy metric on the MIMIC-CXR dataset. Also, human evaluation demonstrates the remarkable capabilities of ViewXGen, particularly in producing realistic view-specific X-rays that closely resemble the original images.

IVSep 11, 2024Code
Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Daeun Kyung, Junu Kim, Tackeun Kim et al.

Chest X-ray (CXR) is an important diagnostic tool widely used in hospitals to assess patient conditions and monitor changes over time. Recently, generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic CXRs. However, these models mainly focus on conditional generation using single-time-point data, i.e., generating CXRs conditioned on their corresponding reports from a specific time. This limits their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. Results show that our framework generates high-quality, realistic future images that effectively capture potential temporal changes. This suggests that our framework could be further developed to support clinical decision-making and provide valuable insights for patient monitoring and treatment planning in the medical field. The code is available at https://github.com/dek924/EHRXDiff.

CLOct 24, 2022
Specializing Multi-domain NMT via Penalizing Low Mutual Information

Jiyoung Lee, Hantae Kim, Hyunchang Cho et al.

Multi-domain Neural Machine Translation (NMT) trains a single model with multiple domains. It is appealing because of its efficacy in handling multiple domains within one model. An ideal multi-domain NMT should learn distinctive domain characteristics simultaneously, however, grasping the domain peculiarity is a non-trivial task. In this paper, we investigate domain-specific information through the lens of mutual information (MI) and propose a new objective that penalizes low MI to become higher. Our method achieved the state-of-the-art performance among the current competitive multi-domain NMT models. Also, we empirically show our objective promotes low MI to be higher resulting in domain-specialized multi-domain NMT.

CLMar 14, 2022
Uncertainty-Aware Text-to-Program for Question Answering on Structured Electronic Health Records

Daeyoung Kim, Seongsu Bae, Seungho Kim et al.

Question Answering on Electronic Health Records (EHR-QA) has a significant impact on the healthcare domain, and it is being actively studied. Previous research on structured EHR-QA focuses on converting natural language queries into query language such as SQL or SPARQL (NLQ2Query), so the problem scope is limited to pre-defined data types by the specific query language. In order to expand the EHR-QA task beyond this limitation to handle multi-modal medical data and solve complex inference in the future, more primitive systemic language is needed. In this paper, we design the program-based model (NLQ2Program) for EHR-QA as the first step towards the future direction. We tackle MIMICSPARQL*, the graph-based EHR-QA dataset, via a program-based approach in a semi-supervised manner in order to overcome the absence of gold programs. Without the gold program, our proposed model shows comparable performance to the previous state-of-the-art model, which is an NLQ2Query model (0.9% gain). In addition, for a reliable EHR-QA model, we apply the uncertainty decomposition method to measure the ambiguity in the input question. We empirically confirmed data uncertainty is most indicative of the ambiguity in the input question.

CLOct 23, 2022
Do Language Models Understand Measurements?

Sungjin Park, Seungwoo Ryu, Edward Choi

Recent success of pre-trained language models (PLMs) has stimulated interest in their ability to understand and work with numbers. Yet, the numerical reasoning over measurements has not been formally studied despite their importance. In this study, we show that PLMs lack the capability required for reasoning over measurements. Furthermore, we find that a language model trained on a measurement-rich corpus shows better performance on understanding measurements. We propose a simple embedding strategy to better distinguish between numbers and units, which leads to a significant improvement in the probing tasks.

73.1CLMay 26
Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

Yeonsu Kwon, Jiho Kim, Junseong Choi et al.

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

LGNov 14, 2022
Universal EHR Federated Learning Framework

Junu Kim, Kyunghoon Hur, Seongjun Yang et al.

Federated learning (FL) is the most practical multi-source learning method for electronic healthcare records (EHR). Despite its guarantee of privacy protection, the wide application of FL is restricted by two large challenges: the heterogeneous EHR systems, and the non-i.i.d. data characteristic. A recent research proposed a framework that unifies heterogeneous EHRs, named UniHPF. We attempt to address both the challenges simultaneously by combining UniHPF and FL. Our study is the first approach to unify heterogeneous EHRs into a single FL framework. This combination provides an average of 3.4% performance gain compared to local learning. We believe that our framework is practically applicable in the real-world FL.

SPAug 8, 2022
Automatic Detection of Noisy Electrocardiogram Signals without Explicit Noise Labels

Radhika Dua, Jiyoung Lee, Joon-myoung Kwon et al.

Electrocardiogram (ECG) signals are beneficial in diagnosing cardiovascular diseases, which are one of the leading causes of death. However, they are often contaminated by noise artifacts and affect the automatic and manual diagnosis process. Automatic deep learning-based examination of ECG signals can lead to inaccurate diagnosis, and manual analysis involves rejection of noisy ECG samples by clinicians, which might cost extra time. To address this limitation, we present a two-stage deep learning-based framework to automatically detect the noisy ECG samples. Through extensive experiments and analysis on two different datasets, we observe that the deep learning-based framework can detect slightly and highly noisy ECG samples effectively. We also study the transfer of the model learned on one dataset to another dataset and observe that the framework effectively detects noisy ECG samples.

LGMar 15, 2023
Rediscovery of CNN's Versatility for Text-based Encoding of Raw Electronic Health Records

Eunbyeol Cho, Min Jae Lee, Kyunghoon Hur et al.

Making the most use of abundant information in electronic health records (EHR) is rapidly becoming an important topic in the medical domain. Recent work presented a promising framework that embeds entire features in raw EHR data regardless of its form and medical code standards. The framework, however, only focuses on encoding EHR with minimal preprocessing and fails to consider how to learn efficient EHR representation in terms of computation and memory usage. In this paper, we search for a versatile encoder not only reducing the large data into a manageable size but also well preserving the core information of patients to perform diverse clinical tasks. We found that hierarchically structured Convolutional Neural Network (CNN) often outperforms the state-of-the-art model on diverse tasks such as reconstruction, prediction, and generation, even with fewer parameters and less training time. Moreover, it turns out that making use of the inherent hierarchy of EHR data can boost the performance of any kind of backbone models and clinical tasks performed. Through extensive experiments, we present concrete evidence to generalize our research findings into real-world practice. We give a clear guideline on building the encoder based on the research findings captured while exploring numerous settings.

70.2LGMar 15Code
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

Jungwoo Oh, Hyunseung Chung, Junhee Lee et al.

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce \textbf{ECG-Reasoning-Benchmark}, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.

CLFeb 21, 2024Code
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge

Jiyoung Lee, Minwoo Kim, Seungho Kim et al.

For Large Language Models (LLMs) to be effectively deployed in a specific country, they must possess an understanding of the nation's culture and basic knowledge. To this end, we introduce National Alignment, which measures an alignment between an LLM and a targeted country from two aspects: social value alignment and common knowledge alignment. Social value alignment evaluates how well the model understands nation-specific social values, while common knowledge alignment examines how well the model captures basic knowledge related to the nation. We constructed KorNAT, the first benchmark that measures national alignment with South Korea. For the social value dataset, we obtained ground truth labels from a large-scale survey involving 6,174 unique Korean participants. For the common knowledge dataset, we constructed samples based on Korean textbooks and GED reference materials. KorNAT contains 4K and 6K multiple-choice questions for social value and common knowledge, respectively. Our dataset creation process is meticulously designed and based on statistical sampling theory and was refined through multiple rounds of human review. The experiment results of seven LLMs reveal that only a few models met our reference score, indicating a potential for further enhancement. KorNAT has received government approval after passing an assessment conducted by a government-affiliated organization dedicated to evaluating dataset quality. Samples and detailed evaluation protocols of our dataset can be found in https://huggingface.co/datasets/jiyounglee0523/KorNAT .

CLDec 20, 2024Code
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Sungjin Park, Xiao Liu, Yeyun Gong et al.

Despite recent advances in large language models, open-source models often struggle to consistently perform well on complex reasoning tasks. Existing ensemble methods, whether applied at the token or output levels, fail to address these challenges. In response, we present Language model Ensemble with Monte Carlo Tree Search (LE-MCTS), a novel framework for process-level ensembling of language models. LE-MCTS formulates step-by-step reasoning with an ensemble of language models as a Markov decision process. In this framework, states represent intermediate reasoning paths, while actions consist of generating the next reasoning step using one of the language models selected from a predefined pool. Guided by a process-based reward model, LE-MCTS performs a tree search over the reasoning steps generated by different language models, identifying the most accurate reasoning chain. Experimental results on five mathematical reasoning benchmarks demonstrate that our approach outperforms both single language model decoding algorithms and language model ensemble methods. Notably, LE-MCTS improves performance by 3.6% and 4.3% on the MATH and MQA datasets, respectively, highlighting its effectiveness in solving complex reasoning problems.

LGNov 15, 2022
UniHPF : Universal Healthcare Predictive Framework with Zero Domain Knowledge

Kyunghoon Hur, Jungwoo Oh, Junu Kim et al.

Despite the abundance of Electronic Healthcare Records (EHR), its heterogeneity restricts the utilization of medical data in building predictive models. To address this challenge, we propose Universal Healthcare Predictive Framework (UniHPF), which requires no medical domain knowledge and minimal pre-processing for multiple prediction tasks. Experimental results demonstrate that UniHPF is capable of building large-scale EHR models that can process any form of medical data from distinct EHR systems. We believe that our findings can provide helpful insights for further research on the multi-source learning of EHRs.

CLMar 3, 2024Code
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

Sunjun Kweon, Byungjin Choi, Gyouk Chu et al.

We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59 large language models, spanning proprietary and open-source models, multilingual and Korean-specialized models, and those fine-tuned for clinical applications. Our results show that applying Chain of Thought (CoT) reasoning can enhance the model performance by up to 4.5% compared to direct answering approaches. We also investigate whether MedQA, one of the most widely used medical benchmarks derived from the U.S. Medical Licensing Examination, can serve as a reliable proxy for evaluating model performance in other regions-in this case, Korea. Our correlation analysis between model scores on KorMedMCQA and MedQA reveals that these two benchmarks align no better than benchmarks from entirely different domains (e.g., MedQA and MMLU-Pro). This finding underscores the substantial linguistic and clinical differences between Korean and U.S. medical contexts, reinforcing the need for region-specific medical QA benchmarks. To support ongoing research in Korean healthcare AI, we publicly release the KorMedMCQA via Huggingface.

90.5CLMay 20
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski et al.

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

LGFeb 3, 2024Code
Self-Supervised Contrastive Learning for Long-term Forecasting

Junwoo Park, Daehoon Gwak, Jaegul Choo et al.

Long-term forecasting presents unique challenges due to the time and memory complexity of handling long sequences. Existing methods, which rely on sliding windows to process long sequences, struggle to effectively capture long-term variations that are partially caught within the short window (i.e., outer-window variations). In this paper, we introduce a novel approach that overcomes this limitation by employing contrastive learning and enhanced decomposition architecture, specifically designed to focus on long-term variations. To this end, our contrastive loss incorporates global autocorrelation held in the whole time series, which facilitates the construction of positive and negative pairs in a self-supervised manner. When combined with our decomposition networks, our contrastive learning significantly improves long-term forecasting performance. Extensive experiments demonstrate that our approach outperforms 14 baseline models in multiple experiments over nine long-term benchmarks, especially in challenging scenarios that require a significantly long output for forecasting. Source code is available at https://github.com/junwoopark92/Self-Supervised-Contrastive-Forecsating.

LGOct 29, 2022
Self-Supervised Predictive Coding with Multimodal Fusion for Patient Deterioration Prediction in Fine-grained Time Resolution

Kwanhyung Lee, John Won, Heejung Hyun et al.

Accurate time prediction of patients' critical events is crucial in urgent scenarios where timely decision-making is important. Though many studies have proposed automatic prediction methods using Electronic Health Records (EHR), their coarse-grained time resolutions limit their practical usage in urgent environments such as the emergency department (ED) and intensive care unit (ICU). Therefore, in this study, we propose an hourly prediction method based on self-supervised predictive coding and multi-modal fusion for two critical tasks: mortality and vasopressor need prediction. Through extensive experiments, we prove significant performance gains from both multi-modal fusion and self-supervised predictive regularization, most notably in far-future prediction, which becomes especially important in practice. Our uni-modal/bi-modal/bi-modal self-supervision scored 0.846/0.877/0.897 (0.824/0.855/0.886) and 0.817/0.820/0.858 (0.807/0.81/0.855) with mortality (far-future mortality) and with vasopressor need (far-future vasopressor need) prediction data in AUROC, respectively.

CLMay 23, 2024Code
EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

Jaehee Ryu, Seonhee Cho, Gyubok Lee et al.

In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain. EHR-SeqSQL is available at https://github.com/seonhee99/EHR-SeqSQL.

AIMar 23, 2024Code
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring

Gyubok Lee, Woosog Chay, Seonhee Cho et al.

Text-to-SQL enables users to interact with databases using natural language, simplifying the retrieval and synthesis of information. Despite the remarkable success of large language models (LLMs) in translating natural language questions into SQL queries, widespread deployment remains limited due to two primary challenges. First, the effective use of text-to-SQL models depends on users' understanding of the model's capabilities-the scope of questions the model can correctly answer. Second, the absence of abstention mechanisms can lead to incorrect SQL generation going unnoticed, thereby undermining trust in the model's output. To enable wider deployment, it is crucial to address these challenges in model design and enhance model evaluation to build trust in the model's output. To this end, we introduce TrustSQL, a novel comprehensive benchmark designed to evaluate text-to-SQL reliability-defined as a model's ability to correctly handle any type of input question by generating correct SQL queries for feasible questions and abstaining from generating infeasible ones (e.g., due to schema incompatibility or functionalities beyond SQL). We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches: (1) pipeline-based methods combining SQL generators with infeasible question detectors and SQL error detectors for abstention; and (2) unified methods using a single model for the entire task. Our experimental results reveal that achieving high scores under severe penalties requires significant effort and provide a new perspective on developing text-to-SQL models for safer deployment. TrustSQL is available at https://github.com/glee4810/TrustSQL.

LGOct 31, 2023
General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History

Junu Kim, Chaeeun Shim, Bosco Seong Kyu Yang et al.

Machine learning (ML) has recently shown promising results in medical predictions using electronic health records (EHRs). However, since ML models typically have a limited capability in terms of input sizes, selecting specific medical events from EHRs for use as input is necessary. This selection process, often relying on expert opinion, can cause bottlenecks in development. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate unlimited medical events, select the relevant ones, and make predictions. This allows for an unrestricted input size, eliminating the need for manual event selection. We verified these properties through experiments involving 27 clinical prediction tasks across four independent cohorts, where REMed outperformed the baselines. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.

AIJan 28
ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue

Hyunseung Chung, Jungwoo Oh, Daeun Kyung et al.

Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, we introduce ECG-Agent, the first LLM-based tool-calling agent for multi-turn ECG dialogue. To facilitate its development and evaluation, we also present ECG-Multi-Turn-Dialogue (ECG-MTD) dataset, a collection of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations. We develop ECG-Agents in various sizes, from on-device capable to larger agents. Experimental results show that ECG-Agents outperform baseline ECG-LLMs in response accuracy. Furthermore, on-device agents achieve comparable performance to larger agents in various evaluations that assess response accuracy, tool-calling ability, and hallucinations, demonstrating their viability for real-world applications.

CLFeb 18, 2025Code
R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Sumin Jo, Junseong Choi, Jiho Kim et al.

Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.

CLMay 27, 2025Code
Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon, Geon Choi, Paloma Rabaey et al.

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

AIMay 23, 2025Code
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung, Hyunseung Chung, Seongsu Bae et al.

Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.

CLSep 12, 2025Code
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Gyubok Lee, Elea Bach, Eric Yang et al.

The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

CYJun 20, 2025Code
A Large-Scale Real-World Evaluation of LLM-Based Virtual Teaching Assistant

Sunjun Kweon, Sooyohn Nam, Hyunseung Lim et al.

Virtual Teaching Assistants (VTAs) powered by Large Language Models (LLMs) have the potential to enhance student learning by providing instant feedback and facilitating multi-turn interactions. However, empirical studies on their effectiveness and acceptance in real-world classrooms are limited, leaving their practical impact uncertain. In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students. To assess how student perceptions of the VTA's performance evolve over time, we conduct three rounds of comprehensive surveys at different stages of the course. Additionally, we analyze 3,869 student--VTA interaction pairs to identify common question types and engagement patterns. We then compare these interactions with traditional student--human instructor interactions to evaluate the VTA's role in the learning process. Through a large-scale empirical study and interaction analysis, we assess the feasibility of deploying VTAs in real-world classrooms and identify key challenges for broader adoption. Finally, we release the source code of our VTA system, fostering future advancements in AI-driven education: \texttt{https://github.com/sean0042/VTA}.

CLMay 27, 2025Code
Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Jiyoung Lee, Seungho Kim, Jieun Han et al.

Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.

CVMay 23, 2025Code
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee, Geon Choi, Jung-Oh Lee et al.

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 12 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

AIFeb 26
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Hyungyung Lee, Hangyul Yoon, Edward Choi

Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.

LGJul 9, 2025Code
Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

Eunbyeol Cho, Jiyoun Kim, Minjae Lee et al.

Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/RawMed.