CLSep 28, 2023Code
Qwen Technical ReportJinze Bai, Shuai Bai, Yunfei Chu et al. · pku, tsinghua
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
CLApr 11, 2023Code
RRHF: Rank Responses to Align Language Models with Human Feedback without tearsZheng Yuan, Hongyi Yuan, Chuanqi Tan et al. · tsinghua
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner. Codes available at https://github.com/GanjinZero/RRHF.
CLMar 16, 2023Code
How well do Large Language Models perform in Arithmetic tasks?Zheng Yuan, Hongyi Yuan, Chuanqi Tan et al. · tsinghua
Large language models have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models. In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of large language models. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-llm}.
CLAug 14, 2023Code
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language ModelsKeming Lu, Hongyi Yuan, Zheng Yuan et al. · tsinghua
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.
CLOct 9, 2023Code
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data CompositionGuanting Dong, Hongyi Yuan, Keming Lu et al. · tsinghua
Large language models (LLMs) with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specifically focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.
CVMar 1, 2023Code
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-trainingZheng Yuan, Qiao Jin, Chuanqi Tan et al. · tsinghua
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.
CLOct 9, 2023Code
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningChengpeng Li, Zheng Yuan, Hongyi Yuan et al. · tsinghua
In math reasoning with large language models (LLMs), fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks? To this end, we create two new dataset AugGSM8K and AugMATH, by complicating and diversifying the queries and sampling multiple reasoning paths from GSM8K and MATH. We obtained a series of LLMs called MuggleMath by fine-tuning LLaMA models on AugGSM8K and AugMATH. MuggleMath substantially achieves new state-of-the-art on GSM8K and MATH. A log-linear relationship and a segmented log-linear are presented between MuggleMath's performance and the amount of augmented data on GSM8K and MATH, respectively. We also find that it is weak in out-of-domain math reasoning generalization from AugGSM8K to MATH and from AugMATH to GSM8K, which suggests that augmenting queries that cover a broader range of subjects is more beneficial for generalization. We release our codes and augmented data in https://github.com/OFA-Sys/gsm8k-ScRel.
CLAug 3, 2023
Scaling Relationship on Learning Mathematical Reasoning with Large Language ModelsZheng Yuan, Hongyi Yuan, Chengpeng Li et al. · tsinghua
Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.
CLDec 20, 2022
SeqDiffuSeq: Text Diffusion with Encoder-Decoder TransformersHongyi Yuan, Zheng Yuan, Chuanqi Tan et al. · tsinghua
Diffusion model, a new generative modelling paradigm, has achieved great success in image, audio, and video generation. However, considering the discrete categorical nature of text, it is not trivial to extend continuous diffusion models to natural language, and text diffusion models are less studied. Sequence-to-sequence text generation is one of the essential natural language processing topics. In this work, we apply diffusion models to approach sequence-to-sequence text generation, and explore whether the superiority generation performance of diffusion model can transfer to natural language domain. We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation. SeqDiffuSeq uses an encoder-decoder Transformers architecture to model denoising function. In order to improve generation quality, SeqDiffuSeq combines the self-conditioning technique and a newly proposed adaptive noise schedule technique. The adaptive noise schedule has the difficulty of denoising evenly distributed across time steps, and considers exclusive noise schedules for tokens at different positional order. Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
CLMar 18, 2023Code
Exploring Partial Knowledge Base Inference in Biomedical Entity LinkingHongyi Yuan, Keming Lu, Zheng Yuan · tsinghua
Biomedical entity linking (EL) consists of named entity recognition (NER) and named entity disambiguation (NED). EL models are trained on corpora labeled by a predefined KB. However, it is a common scenario that only entities within a subset of the KB are precious to stakeholders. We name this scenario partial knowledge base inference: training an EL model with one KB and inferring on the part of it without further training. In this work, we give a detailed definition and evaluation procedures for this practically valuable but significantly understudied scenario and evaluate methods from three representative EL paradigms. We construct partial KB inference benchmarks and witness a catastrophic degradation in EL performance due to dramatically precision drop. Our findings reveal these EL paradigms can not correctly handle unlinkable mentions (NIL), so they are not robust to partial KB inference. We also propose two simple-and-effective redemption methods to combat the NIL issue with little computational overhead. Codes are released at https://github.com/Yuanhy1997/PartialKB-EL.
CLNov 15, 2023
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language ModelsKeming Lu, Hongyi Yuan, Runji Lin et al. · tsinghua
The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate it by mining latent expertise with off-the-shelf reward models. We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. Zooter shows computation efficiency in inference as it introduces only a minor computation overhead of a routing function compared with reward model ranking methods. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks. Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.
CLApr 8, 2022
BioBART: Pretraining and Evaluation of A Biomedical Generative Language ModelHongyi Yuan, Zheng Yuan, Ruyi Gan et al. · tsinghua
Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.
CLDec 17, 2022
HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation PerturbationHongyi Yuan, Zheng Yuan, Chuanqi Tan et al. · tsinghua
Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
CLMar 18, 2022
BIOS: An Algorithmically Generated Biomedical Knowledge GraphSheng Yu, Zheng Yuan, Jun Xia et al. · tsinghua
Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For decades, these knowledge graphs have been developed via expert curation; however, this method can no longer keep up with today's AI development, and a transition to algorithmically generated BioMedKGs is necessary. In this work, we introduce the Biomedical Informatics Ontology System (BIOS), the first large-scale publicly available BioMedKG generated completely by machine learning algorithms. BIOS currently contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets. We present the methodology for developing BIOS, including the curation of raw biomedical terms, computational identification of synonymous terms and aggregation of these terms to create concept nodes, semantic type classification of the concepts, relation identification, and biomedical machine translation. We provide statistics on the current BIOS content and perform preliminary assessments of term quality, synonym grouping, and relation extraction. The results suggest that machine learning-based BioMedKG development is a viable alternative to traditional expert curation.
CLApr 11, 2022
Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuningHongyi Yuan, Zheng Yuan, Sheng Yu · tsinghua
Entities lie in the heart of biomedical natural language understanding, and the biomedical entity linking (EL) task remains challenging due to the fine-grained and diversiform concept names. Generative methods achieve remarkable performances in general domain EL with less memory usage while requiring expensive pre-training. Previous biomedical EL methods leverage synonyms from knowledge bases (KB) which is not trivial to inject into a generative method. In this work, we use a generative approach to model biomedical EL and propose to inject synonyms knowledge in it. We propose KB-guided pre-training by constructing synthetic samples with synonyms and definitions from KB and require the model to recover concept names. We also propose synonyms-aware fine-tuning to select concept names for training, and propose decoder prompt and multi-synonyms constrained prefix tree for inference. Our method achieves state-of-the-art results on several biomedical EL tasks without candidate selection which displays the effectiveness of proposed pre-training and fine-tuning strategies.
CLNov 15, 2023
Speculative Contrastive DecodingHongyi Yuan, Keming Lu, Fei Huang et al. · tsinghua
Large language models~(LLMs) exhibit exceptional performance in language tasks, yet their auto-regressive inference is limited due to high computational requirements and is sub-optimal due to the exposure bias. Inspired by speculative decoding and contrastive decoding, we introduce Speculative Contrastive Decoding~(SCD), a straightforward yet powerful decoding approach that leverages predictions from smaller language models~(LMs) to achieve both decoding acceleration and quality improvement. Extensive evaluations and analyses on four diverse language tasks demonstrate the effectiveness of SCD, showing that decoding efficiency and quality can compatibly benefit from one smaller LM.
LGMar 10, 2023
EHRDiff: Exploring Realistic EHR Synthesis with Diffusion ModelsHongyi Yuan, Songchi Zhou, Sheng Yu
Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.
CLMar 18, 2023
Revisiting Automatic Question Summarization Evaluation in the Biomedical DomainHongyi Yuan, Yaoyun Zhang, Fei Huang et al.
Automatic evaluation metrics have been facilitating the rapid development of automatic summarization methods by providing instant and fair assessments of the quality of summaries. Most metrics have been developed for the general domain, especially news and meeting notes, or other language-generation tasks. However, these metrics are applied to evaluate summarization systems in different domains, such as biomedical question summarization. To better understand whether commonly used evaluation metrics are capable of evaluating automatic summarization in the biomedical domain, we conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task. Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems as well. We also release a dataset of our human annotations to aid the research of summarization evaluation metrics in the biomedical domain.
AIJan 27
Visual Generation Unlocks Human-Like Reasoning through Multimodal World ModelsJialong Wu, Xiaoying Zhang, Hongyi Yuan et al.
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.
CVNov 25, 2025Code
VQ-VA World: Towards High-Quality Visual Question-Visual AnsweringChenhui Gou, Zilong Chen, Zeyu Wang et al.
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
AIJul 31, 2025
Seed-Prover: Deep and Broad Reasoning for Automated Theorem ProvingLuoxin Chen, Jinming Gu, Liankai Huang et al. · cmu
LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
CLSep 29, 2024
Black-Box Segmentation of Electronic Medical RecordsHongyi Yuan, Sheng Yu
Electronic medical records (EMRs) contain the majority of patients' healthcare details. It is an abundant resource for developing an automatic healthcare system. Most of the natural language processing (NLP) studies on EMR processing, such as concept extraction, are adversely affected by the inaccurate segmentation of EMR sections. At the same time, not enough attention has been given to the accurate sectioning of EMRs. The information that may occur in section structures is unvalued. This work focuses on the segmentation of EMRs and proposes a black-box segmentation method using a simple sentence embedding model and neural network, along with a proper training method. To achieve universal adaptivity, we train our model on the dataset with different section headings formats. We compare several advanced deep learning-based NLP methods, and our method achieves the best segmentation accuracies (above 98%) on various test data with a proper training corpus.
CLJan 30, 2025
GENIE: Generative Note Information Extraction model for structuring EHR dataHuaiyuan Ying, Hongyi Yuan, Jinsen Lu et al.
Electronic Health Records (EHRs) hold immense potential for advancing healthcare, offering rich, longitudinal data that combines structured information with valuable insights from unstructured clinical notes. However, the unstructured nature of clinical text poses significant challenges for secondary applications. Traditional methods for structuring EHR free-text data, such as rule-based systems and multi-stage pipelines, are often limited by their time-consuming configurations and inability to adapt across clinical notes from diverse healthcare settings. Few systems provide a comprehensive attribute extraction for terminologies. While giant large language models (LLMs) like GPT-4 and LLaMA 405B excel at structuring tasks, they are slow, costly, and impractical for large-scale use. To overcome these limitations, we introduce GENIE, a Generative Note Information Extraction system that leverages LLMs to streamline the structuring of unstructured clinical text into usable data with standardized format. GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifiers, values, and purposes with high accuracy. Its unified, end-to-end approach simplifies workflows, reduces errors, and eliminates the need for extensive manual intervention. Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks, outperforming traditional tools like cTAKES and MetaMap and can handle extra attributes to be extracted. GENIE strongly enhances real-world applicability and scalability in healthcare systems. By open-sourcing the model and test data, we aim to encourage collaboration and drive further advancements in EHR structurization.
IRFeb 10, 2025
CliniQ: A Multi-faceted Benchmark for Electronic Health Record Retrieval with Semantic Match AssessmentZhengyun Zhao, Hongyi Yuan, Jingjing Liu et al.
Electronic Health Record (EHR) retrieval plays a pivotal role in various clinical tasks, but its development has been severely impeded by the lack of publicly available benchmarks. In this paper, we introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap. We consider two retrieval settings: Single-Patient Retrieval and Multi-Patient Retrieval, reflecting various real-world scenarios. Single-Patient Retrieval focuses on finding relevant parts within a patient note, while Multi-Patient Retrieval involves retrieving EHRs from multiple patients. We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III, and collect 1,246 unique queries with 77,206 relevance judgments by further leveraging powerful LLMs as annotators. Additionally, we include a novel assessment of the semantic gap issue in EHR retrieval by categorizing matching types into string match and four types of semantic matches. On our proposed benchmark, we conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers. Our experiments find that BM25 sets a strong baseline and performs competitively to the dense retrievers, and general domain dense retrievers surprisingly outperform those designed for the medical domain. In-depth analyses on various matching types reveal the strengths and drawbacks of different methods, enlightening the potential for targeted improvement. We believe that our benchmark will stimulate the research communities to advance EHR retrieval systems.
LGOct 14, 2024
Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive LearningHongyi Yuan, Suqi Liu, Kelly Cho et al.
We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL's ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined identification of concepts. This pioneering approach in constructing a unified embedding system for both SNPs and biomedical concepts enhances the potential for data integration and discovery in biomedical research.
LGDec 1, 2021
Efficient Symptom Inquiring and Diagnosis via Adaptive Alignment of Reinforcement Learning and ClassificationHongyi Yuan, Sheng Yu
Medical automatic diagnosis aims to imitate human doctors in real-world diagnostic processes and to achieve accurate diagnoses by interacting with the patients. The task is formulated as a sequential decision-making problem with a series of symptom inquiring steps and the final diagnosis. Recent research has studied incorporating reinforcement learning for symptom inquiring and classification techniques for disease diagnosis, respectively. However, studies on efficiently and effectively combining the two procedures are still lacking. To address this issue, we devise an adaptive mechanism to align reinforcement learning and classification methods using distribution entropy as the medium. Additionally, we created a new dataset for patient simulation to address the lacking of large-scale evaluation benchmarks. The dataset is extracted from the MedlinePlus knowledge base and contains significantly more diseases and more comprehensive symptoms and examination information than existing datasets. Experimental evaluation shows that our method outperforms three current state-of-the-art methods on different datasets by achieving higher medical diagnosis accuracy with fewer inquiring turns.