Hong Yu

CL
h-index36
137papers
16,839citations
Novelty47%
AI Score61

137 Papers

CVApr 9, 2022Code
Attention guided global enhancement and local refinement network for semantic segmentation

Jiangyun Li, Sen Zha, Chen Chen et al.

The encoder-decoder architecture is widely used as a lightweight semantic segmentation network. However, it struggles with a limited performance compared to a well-designed Dilated-FCN model for two major problems. First, commonly used upsampling methods in the decoder such as interpolation and deconvolution suffer from a local receptive field, unable to encode global contexts. Second, low-level features may bring noises to the network decoder through skip connections for the inadequacy of semantic concepts in early encoder layers. To tackle these challenges, a Global Enhancement Method is proposed to aggregate global information from high-level feature maps and adaptively distribute them to different decoder layers, alleviating the shortage of global contexts in the upsampling process. Besides, a Local Refinement Module is developed by utilizing the decoder features as the semantic guidance to refine the noisy encoder features before the fusion of these two (the decoder features and the encoder features). Then, the two methods are integrated into a Context Fusion Block, and based on that, a novel Attention guided Global enhancement and Local refinement Network (AGLN) is elaborately designed. Extensive experiments on PASCAL Context, ADE20K, and PASCAL VOC 2012 datasets have demonstrated the effectiveness of the proposed approach. In particular, with a vanilla ResNet-101 backbone, AGLN achieves the state-of-the-art result (56.23% mean IoU) on the PASCAL Context dataset. The code is available at https://github.com/zhasen1996/AGLN.

LGOct 29, 2023
Boosting Decision-Based Black-Box Adversarial Attack with Gradient Priors

Han Liu, Xingshuo Huang, Xiaotong Zhang et al. · apple-ml

Decision-based methods have shown to be effective in black-box adversarial attacks, as they can obtain satisfactory performance and only require to access the final model prediction. Gradient estimation is a critical step in black-box adversarial attacks, as it will directly affect the query efficiency. Recent works have attempted to utilize gradient priors to facilitate score-based methods to obtain better results. However, these gradient priors still suffer from the edge gradient discrepancy issue and the successive iteration gradient direction issue, thus are difficult to simply extend to decision-based methods. In this paper, we propose a novel Decision-based Black-box Attack framework with Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent gradient prior and time-dependent prior into the gradient estimation procedure. First, by leveraging the joint bilateral filter to deal with each random perturbation, DBA-GP can guarantee that the generated perturbations in edge locations are hardly smoothed, i.e., alleviating the edge gradient discrepancy, thus remaining the characteristics of the original image as much as possible. Second, by utilizing a new gradient updating strategy to automatically adjust the successive iteration gradient direction, DBA-GP can accelerate the convergence speed, thus improving the query efficiency. Extensive experiments have demonstrated that the proposed method outperforms other strong baselines significantly.

CLJul 26, 2023
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong et al.

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

CLMay 29, 2022
Learning as Conversation: Dialogue Systems Reinforced for Information Acquisition

Pengshan Cai, Hui Wan, Fei Liu et al. · ibm-research

We propose novel AI-empowered chat bots for learning as conversation where a user does not read a passage but gains information and knowledge through conversation with a teacher bot. Our information-acquisition-oriented dialogue system employs a novel adaptation of reinforced self-play so that the system can be transferred to various domains without in-domain dialogue data, and can carry out conversations both informative and attentive to users. Our extensive subjective and objective evaluations on three large public data corpora demonstrate the effectiveness of our system to deliver knowledge-intensive and attentive conversations and help end users substantially gain knowledge without reading passages. Our code and datasets are publicly available for follow-up research.

SDApr 18, 2022Code
Caption Feature Space Regularization for Audio Captioning

Yiming Zhang, Hong Yu, Ruoyi Du et al.

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/PRIS-CV/Caption-Feature-Space-Regularization.

CLDec 11, 2022
Associations Between Natural Language Processing (NLP) Enriched Social Determinants of Health and Suicide Death among US Veterans

Avijit Mitra, Richeek Pradhan, Rachel D Melamed et al.

Importance: Social determinants of health (SDOH) are known to be associated with increased risk of suicidal behaviors, but few studies utilized SDOH from unstructured electronic health record (EHR) notes. Objective: To investigate associations between suicide and recent SDOH, identified using structured and unstructured data. Design: Nested case-control study. Setting: EHR data from the US Veterans Health Administration (VHA). Participants: 6,122,785 Veterans who received care in the US VHA between October 1, 2010, and September 30, 2015. Exposures: Occurrence of SDOH over a maximum span of two years compared with no occurrence of SDOH. Main Outcomes and Measures: Cases of suicide deaths were matched with 4 controls on birth year, cohort entry date, sex, and duration of follow-up. We developed an NLP system to extract SDOH from unstructured notes. Structured data, NLP on unstructured data, and combining them yielded six, eight and nine SDOH respectively. Adjusted odds ratios (aORs) and 95% confidence intervals (CIs) were estimated using conditional logistic regression. Results: In our cohort, 8,821 Veterans committed suicide during 23,725,382 person-years of follow-up (incidence rate 37.18/100,000 person-years). Our cohort was mostly male (92.23%) and white (76.99%). Across the five common SDOH as covariates, NLP-extracted SDOH, on average, covered 80.03% of all SDOH occurrences. All SDOH, measured by structured data and NLP, were significantly associated with increased risk of suicide. The SDOH with the largest effects was legal problems (aOR=2.66, 95% CI=.46-2.89), followed by violence (aOR=2.12, 95% CI=1.98-2.27). NLP-extracted and structured SDOH were also associated with suicide. Conclusions and Relevance: NLP-extracted SDOH were always significantly associated with increased risk of suicide among Veterans, suggesting the potential of NLP in public health studies.

IVMar 29, 2022Code
Category Guided Attention Network for Brain Tumor Segmentation in MRI

Jiangyun Li, Hong Yu, Chen Chen et al.

Objective: Magnetic resonance imaging (MRI) has been widely used for the analysis and diagnosis of brain diseases. Accurate and automatic brain tumor segmentation is of paramount importance for radiation treatment. However, low tissue contrast in tumor regions makes it a challenging task.Approach: We propose a novel segmentation network named Category Guided Attention U-Net (CGA U-Net). In this model, we design a Supervised Attention Module (SAM) based on the attention mechanism, which can capture more accurate and stable long-range dependency in feature maps without introducing much computational cost. Moreover, we propose an intra-class update approach to reconstruct feature maps by aggregating pixels of the same category. Main results: Experimental results on the BraTS 2019 datasets show that the proposed method outperformers the state-of-the-art algorithms in both segmentation performance and computational complexity. Significance: The CGA U-Net can effectively capture the global semantic information in the MRI image by using the SAM module, while significantly reducing the computational cost. Code is available at https://github.com/delugewalker/CGA-U-Net.

62.1CLMay 28
Counterfactual Graph for Multi-Agent LLM Calibration

Jiatan Huang, Mingchen Li, Ziming Li et al.

Multi-agent LLM systems often treat agreement as evidence: when many agents in a panel give the same answer, that answer is assumed to be more reliable. We show that this assumption can fail after agents communicate. Communication can induce correlated failures and false consensus, so the same vote share may reflect reliable agreement in one topology but over-confidence in another. We propose CAGE-CAL, a counterfactual agent-graph calibration framework for multi-agent LLMs. For each query, CAGE-CAL compares an observed post-communication agent graph with a matched counterfactual no-communication graph, capturing both pairwise failure correlations and group-level dependencies. Rather than simply counting how many agents agree, CAGE-CAL estimates the counterfactual shift between observed and no-communication dependence, and calibrates confidence accordingly. Across five benchmarks, CAGE-CAL improves reliability discrimination with competitive ECE, and its calibrated confidence further improves topology selection over the best fixed-topology strategy.

CLAug 7, 2023
PaniniQA: Enhancing Patient Education Through Interactive Question Answering

Pengshan Cai, Zonghai Yao, Fei Liu et al.

Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions

CLMar 26, 2023
Boosting Few-Shot Text Classification via Distribution Estimation

Han Liu, Feng Zhang, Xiaotong Zhang et al. · apple-ml

Distribution estimation has been demonstrated as one of the most effective approaches in dealing with few-shot image classification, as the low-level patterns and underlying representations can be easily transferred across different tasks in computer vision domain. However, directly applying this approach to few-shot text classification is challenging, since leveraging the statistics of known classes with sufficient samples to calibrate the distributions of novel classes may cause negative effects due to serious category difference in text domain. To alleviate this issue, we propose two simple yet effective strategies to estimate the distributions of the novel classes by utilizing unlabeled query samples, thus avoiding the potential negative transfer issue. Specifically, we first assume a class or sample follows the Gaussian distribution, and use the original support set and the nearest few query samples to estimate the corresponding mean and covariance. Then, we augment the labeled samples by sampling from the estimated distribution, which can provide sufficient supervision for training the classification model. Extensive experiments on eight few-shot text classification datasets show that the proposed method outperforms state-of-the-art baselines significantly.

40.4MEJun 1
Scalable Counterfactual Risk Estimation for Rare Events in Longitudinal Data

Xiaohui Yin, Avijit Mitra, Ying Zhou et al.

Estimating the causal effect of time-varying treatments on survival outcomes in large observational studies is computationally demanding, particularly when outcomes are rare. While g-formula-based methods such as the iterative conditional expectation (ICE) estimator provide a principled framework for longitudinal causal inference, they become computationally expensive, especially when bootstrap-based variance estimation is required. In addition, outcome rarity at each time point induces severe class imbalance, leading to instability and convergence issues in logistic regression and related models. To address these challenges, we propose a principled subsampling and reweighting strategy for longitudinal survival data that can be applied to a range of existing causal effect estimators in this setting, including the ICE estimator. The proposed method substantially reduces computational burden while preserving consistency and improving estimation stability in rare-outcome settings. We evaluate the method through simulations and validate it using a large-scale EHR cohort study on social and behavioral determinants of health (SBDH) and suicide risk, demonstrating its effectiveness for modeling rare outcomes in longitudinal data.

88.8AIApr 16
CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

Yubin Kim, Salman Rahman, Samuel Schmidgall et al.

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated ΔR^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

CLNov 24, 2022
Multi-label Few-shot ICD Coding as Autoregressive Generation with Prompt

Zhichao Yang, Sunjae Kwon, Zonghai Yao et al.

Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedure using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infer ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F1 30.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.

CLOct 12, 2022
MedJEx: A Medical Jargon Extraction Model with Wiki's Hyperlink Span and Contextualized Masked Language Model Score

Sunjae Kwon, Zonghai Yao, Harmon S. Jordan et al.

This paper proposes a new natural language processing (NLP) application for identifying medical jargon terms potentially difficult for patients to comprehend from electronic health record (EHR) notes. We first present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences ($MedJ$). Then, we introduce a novel medical jargon extraction ($MedJEx$) model which has been shown to outperform existing state-of-the-art NLP models. First, MedJEx improved the overall performance when it was trained on an auxiliary Wikipedia hyperlink span dataset, where hyperlink spans provide additional Wikipedia articles to explain the spans (or terms), and then fine-tuned on the annotated MedJ data. Secondly, we found that a contextualized masked language model score was beneficial for detecting domain-specific unfamiliar jargon terms. Moreover, our results show that training on the auxiliary Wikipedia hyperlink span datasets improved six out of eight biomedical named entity recognition benchmark datasets. Both MedJ and MedJEx are publicly available.

LGMay 12, 2022
ScAN: Suicide Attempt and Ideation Events Dataset

Bhanu Pratap Singh Rawat, Samuel Kovaly, Wilfred R. Pigeon et al.

Suicide is an important public health concern and one of the leading causes of death worldwide. Suicidal behaviors, including suicide attempts (SA) and suicide ideations (SI), are leading risk factors for death by suicide. Information related to patients' previous and current SA and SI are frequently documented in the electronic health record (EHR) notes. Accurate detection of such documentation may help improve surveillance and predictions of patients' suicidal behaviors and alert medical professionals for suicide prevention efforts. In this study, we first built Suicide Attempt and Ideation Events (ScAN) dataset, a subset of the publicly available MIMIC III dataset spanning over 12k+ EHR notes with 19k+ annotated SA and SI events information. The annotations also contain attributes such as method of suicide attempt. We also provide a strong baseline model ScANER (Suicide Attempt and Ideation Events Retriever), a multi-task RoBERTa-based model with a retrieval module to extract all the relevant suicidal behavioral evidences from EHR notes of an hospital-stay and, and a prediction module to identify the type of suicidal behavior (SA and SI) concluded during the patient's stay at the hospital. ScANER achieved a macro-weighted F1-score of 0.83 for identifying suicidal behavioral evidences and a macro F1-score of 0.78 and 0.60 for classification of SA and SI for the patient's hospital-stay, respectively. ScAN and ScANER are publicly available.

CLDec 6, 2022
Automated Identification of Eviction Status from Electronic Health Record Notes

Zonghai Yao, Jack Tsai, Weisong Liu et al.

Objective: Evictions are important social and behavioral determinants of health. Evictions are associated with a cascade of negative events that can lead to unemployment, housing insecurity/homelessness, long-term poverty, and mental health problems. In this study, we developed a natural language processing system to automatically detect eviction status from electronic health record (EHR) notes. Materials and Methods: We first defined eviction status (eviction presence and eviction period) and then annotated eviction status in 5000 EHR notes from the Veterans Health Administration (VHA). We developed a novel model, KIRESH, that has shown to substantially outperform other state-of-the-art models such as fine-tuning pre-trained language models like BioBERT and BioClinicalBERT. Moreover, we designed a novel prompt to further improve the model performance by using the intrinsic connection between the two sub-tasks of eviction presence and period prediction. Finally, we used the Temperature Scaling-based Calibration on our KIRESH-Prompt method to avoid over-confidence issues arising from the imbalance dataset. Results: KIRESH-Prompt substantially outperformed strong baseline models including fine-tuning the BioClinicalBERT model to achieve 0.74672 MCC, 0.71153 Macro-F1, and 0.83396 Micro-F1 in predicting eviction period and 0.66827 MCC, 0.62734 Macro-F1, and 0.7863 Micro-F1 in predicting eviction presence. We also conducted additional experiments on a benchmark social determinants of health (SBDH) dataset to demonstrate the generalizability of our methods. Conclusion and Future Work: KIRESH-Prompt has substantially improved eviction status classification. We plan to deploy KIRESH-Prompt to the VHA EHRs as an eviction surveillance system to help address the US Veterans' housing insecurity.

IRAug 26, 2022
Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record Context

Zonghai Yao, Yi Cao, Zhichao Yang et al.

Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.

99.6CVApr 14Code
Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang et al.

Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.

CLJul 5, 2023
ODD: A Benchmark Dataset for the Natural Language Processing based Opioid Related Aberrant Behavior Detection

Sunjae Kwon, Xun Wang, Weisong Liu et al.

Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.

CLNov 18, 2022
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing

Zonghai Yao, Yi Cao, Zhichao Yang et al.

Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs' knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the first time. Through experiments on 12 PLMs, our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle "understand" from just "read and copy".

CLJun 29, 2023
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?

Junda Wang, Zonghai Yao, Avijit Mitra et al.

This paper presents UMASS_BioNLP team participation in the MEDIQA-Chat 2023 shared task for Task-A and Task-C. We focus especially on Task-C and propose a novel LLMs cooperation system named a doctor-patient loop to generate high-quality conversation data sets. The experiment results demonstrate that our approaches yield reasonable performance as evaluated by automatic metrics such as ROUGE, medical concept recall, BLEU, and Self-BLEU. Furthermore, we conducted a comparative analysis between our proposed method and ChatGPT and GPT-4. This analysis also investigates the potential of utilizing cooperation LLMs to generate high-quality datasets.

CLJul 26, 2022
Advanced Conditional Variational Autoencoders (A-CVAE): Towards interpreting open-domain conversation generation via disentangling latent feature representation

Ye Wang, Jingbo Liao, Hong Yu et al.

Currently end-to-end deep learning based open-domain dialogue systems remain black box models, making it easy to generate irrelevant contents with data-driven models. Specifically, latent variables are highly entangled with different semantics in the latent space due to the lack of priori knowledge to guide the training. To address this problem, this paper proposes to harness the generative model with a priori knowledge through a cognitive approach involving mesoscopic scale feature disentanglement. Particularly, the model integrates the macro-level guided-category knowledge and micro-level open-domain dialogue data for the training, leveraging the priori knowledge into the latent space, which enables the model to disentangle the latent variables within the mesoscopic scale. Besides, we propose a new metric for open-domain dialogues, which can objectively evaluate the interpretability of the latent space distribution. Finally, we validate our model on different datasets and experimentally demonstrate that our model is able to generate higher quality and more interpretable dialogues than other models.

LGJul 23, 2023
Early Prediction of Alzheimers Disease Leveraging Symptom Occurrences from Longitudinal Electronic Health Records of US Military Veterans

Rumeng Li, Xun Wang, Dan Berlowitz et al.

Early prediction of Alzheimer's disease (AD) is crucial for timely intervention and treatment. This study aims to use machine learning approaches to analyze longitudinal electronic health records (EHRs) of patients with AD and identify signs and symptoms that can predict AD onset earlier. We used a case-control design with longitudinal EHRs from the U.S. Department of Veterans Affairs Veterans Health Administration (VHA) from 2004 to 2021. Cases were VHA patients with AD diagnosed after 1/1/2016 based on ICD-10-CM codes, matched 1:9 with controls by age, sex and clinical utilization with replacement. We used a panel of AD-related keywords and their occurrences over time in a patient's longitudinal EHRs as predictors for AD prediction with four machine learning models. We performed subgroup analyses by age, sex, and race/ethnicity, and validated the model in a hold-out and "unseen" VHA stations group. Model discrimination, calibration, and other relevant metrics were reported for predictions up to ten years before ICD-based diagnosis. The study population included 16,701 cases and 39,097 matched controls. The average number of AD-related keywords (e.g., "concentration", "speaking") per year increased rapidly for cases as diagnosis approached, from around 10 to over 40, while remaining flat at 10 for controls. The best model achieved high discriminative accuracy (ROCAUC 0.997) for predictions using data from at least ten years before ICD-based diagnoses. The model was well-calibrated (Hosmer-Lemeshow goodness-of-fit p-value = 0.99) and consistent across subgroups of age, sex and race/ethnicity, except for patients younger than 65 (ROCAUC 0.746). Machine learning models using AD-related keywords identified from EHR notes can predict future AD diagnoses, suggesting its potential use for identifying AD risk using EHR notes, offering an affordable way for early screening on large population.

CLJun 14, 2022
Label-enhanced Prototypical Network with Contrastive Learning for Multi-label Few-shot Aspect Category Detection

Han Liu, Feng Zhang, Xiaotong Zhang et al.

Multi-label aspect category detection allows a given review sentence to contain multiple aspect categories, which is shown to be more practical in sentiment analysis and attracting increasing attention. As annotating large amounts of data is time-consuming and labor-intensive, data scarcity occurs frequently in real-world scenarios, which motivates multi-label few-shot aspect category detection. However, research on this problem is still in infancy and few methods are available. In this paper, we propose a novel label-enhanced prototypical network (LPN) for multi-label few-shot aspect category detection. The highlights of LPN can be summarized as follows. First, it leverages label description as auxiliary knowledge to learn more discriminative prototypes, which can retain aspect-relevant information while eliminating the harmful effect caused by irrelevant aspects. Second, it integrates with contrastive learning, which encourages that the sentences with the same aspect label are pulled together in embedding space while simultaneously pushing apart the sentences with different aspect labels. In addition, it introduces an adaptive multi-label inference module to predict the aspect count in the sentence, which is simple yet effective. Extensive experimental results on three datasets demonstrate that our proposed model LPN can consistently achieve state-of-the-art performance.

69.7HCMay 4
Bridging Knowledge Gaps in Clinical AI: An Activity Theory Perspective on Interdisciplinary Data Work for Telehealth

Bingsheng Yao, Yao Du, Yue Fu et al.

Advanced AI technologies are increasingly integrated into clinical domains to advance patient care. The design and development of clinical AI technologies necessitate seamless collaboration between clinical and technical experts. However, such interdisciplinary teams are often unsuccessful, with a lack of systematic analysis of collaboration barriers and coping strategies. This work examines two clinical AI collaborations in the context of speech-language pathology via semi-structured interviews with six clinical and seven technical experts. Using Activity Theory (AT) as our analytical lens, we examine persistent knowledge gaps and collaboration tensions across clinical and technical workflows, and show how clinical data can function as boundary objects while interdisciplinary collaborators may act as knowledge brokers to help address these challenges. Our findings contribute to CSCW research on interdisciplinary teams' data work by showing how shared clinical data, boundary objects, and broker roles shape coordination in early-stage clinical AI collaboration, and by providing insights into best practices for future collaboration.

CLOct 7, 2022
Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding

Zhichao Yang, Shufan Wang, Bhanu Pratap Singh Rawat et al.

Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with average length of 3,000+ tokens. This task is challenging due to a high-dimensional space of multi-label assignment (tens of thousands of ICD codes) and the long-tail challenge: only a few codes (common diseases) are frequently assigned while most codes (rare diseases) are infrequently assigned. This study addresses the long-tail challenge by adapting a prompt-based fine-tuning technique with label semantics, which has been shown to be effective under few-shot setting. To further enhance the performance in medical domain, we propose a knowledge-enhanced longformer by injecting three domain-specific knowledge: hierarchy, synonym, and abbreviation with additional pretraining using contrastive learning. Experiments on MIMIC-III-full, a benchmark dataset of code assignment, show that our proposed method outperforms previous state-of-the-art method in 14.5% in marco F1 (from 10.3 to 11.8, P<0.001). To further test our model on few-shot setting, we created a new rare diseases coding dataset, MIMIC-III-rare50, on which our model improves marco F1 from 17.1 to 30.4 and micro F1 from 17.2 to 32.6 compared to previous method.

CLOct 30, 2023
BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran, Zhichao Yang, Zonghai Yao et al.

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

CLJun 5, 2022
A Simple Meta-learning Paradigm for Zero-shot Intent Classification with Mixture Attention Mechanism

Han Liu, Siyang Zhao, Xiaotong Zhang et al.

Zero-shot intent classification is a vital and challenging task in dialogue systems, which aims to deal with numerous fast-emerging unacquainted intents without annotated training data. To obtain more satisfactory performance, the crucial points lie in two aspects: extracting better utterance features and strengthening the model generalization ability. In this paper, we propose a simple yet effective meta-learning paradigm for zero-shot intent classification. To learn better semantic representations for utterances, we introduce a new mixture attention mechanism, which encodes the pertinent word occurrence patterns by leveraging the distributional signature attention and multi-layer perceptron attention simultaneously. To strengthen the transfer ability of the model from seen classes to unseen classes, we reformulate zero-shot intent classification with a meta-learning strategy, which trains the model by simulating multiple zero-shot classification tasks on seen categories, and promotes the model generalization ability with a meta-adapting procedure on mimic unseen categories. Extensive experiments on two real-world dialogue datasets in different languages show that our model outperforms other strong baselines on both standard and generalized zero-shot intent classification tasks.

CLOct 24, 2023
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes

Junda Wang, Zonghai Yao, Zhichao Yang et al.

We introduce NoteChat, a novel cooperative multi-agent framework leveraging Large Language Models (LLMs) to generate patient-physician dialogues. NoteChat embodies the principle that an ensemble of role-specific LLMs, through structured role-play and strategic prompting, can perform their assigned roles more effectively. The synergy among these role-playing LLMs results in a cohesive and efficient dialogue generation. Evaluation on MTS-dialogue, a benchmark dataset for patient-physician dialogues-note pairs, shows that models trained with the augmented synthetic patient-physician dialogues by NoteChat outperforms other state-of-the-art models for generating clinical notes. Our comprehensive automatic and human evaluation demonstrates that NoteChat substantially surpasses state-of-the-art models like ChatGPT and GPT-4 up to 22.78% by domain experts in generating superior synthetic patient-physician dialogues based on clinical notes. NoteChat has the potential to engage patients directly and help clinical documentation, a leading cause of physician burnout.

AIDec 22, 2022
Enhancing the prediction of disease outcomes using electronic health records and pretrained deep learning models

Zhichao Yang, Weisong Liu, Dan Berlowitz et al.

Question: Can an encoder-decoder architecture pretrained on a large dataset of longitudinal electronic health records improves patient outcome predictions? Findings: In this prognostic study of 6.8 million patients, our denoising sequence-to-sequence prediction model of multiple outcomes outperformed state-of-the-art models scuh pretrained BERT on a broad range of patient outcomes, including intentional self-harm and pancreatic cancer. Meaning: Deep bidirectional and autoregressive representation improves patient outcome prediction.

CLAug 7, 2023
Intelligent Assistant Language Understanding On Device

Cecilia Aas, Hisham Abdelsalam, Irina Belousova et al.

It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper we describe a design for a natural language understanding system that runs on device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may help inform future work in the research community.

52.1LGMar 16Code
Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning

Ye Wang, Zixuan Wu, Lifeng Shen et al.

Imbalanced data distribution remains a critical challenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes adequately. The Mixture-of-Experts model offers a scalable solution, yet its application is often hindered by parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts. To Master the Minority classes effectively, we propose the Uncertainty-based Multi-Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter-efficient modeling, significantly reducing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster-Shafer Theory (DST), which ensures effective specialization on the challenging-tailed classes. Finally, an Uncertainty-Guided Fusion mechanism uses DST's certainty measures to dynamically weigh expert opinions, resolving conflicts by prioritizing the most confident expert for reliable final predictions. Extensive experiments across four public hierarchical text classification datasets demonstrate that UME achieves state-of-the-art performance. We achieve a performance gain of up to 17.97\% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32\%. The findings highlight that uncertainty-guided expert coordination is a principled strategy for addressing challenging-tailed sequence learning. Our code is available at https://github.com/CQUPTWZX/Multi-experts.

47.9CLMay 25
In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Mingchen Li, Jiatan Huang, Chuxu Zhang et al.

In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.

56.0CLMay 25
RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

Mingchen Li, Hansi Zeng, Zhuo Qian et al.

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

CLOct 30, 2023
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization

Prakamya Mishra, Zonghai Yao, Shuwei Chen et al.

Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.

CLOct 30, 2023
EHRTutor: Enhancing Patient Understanding of Discharge Instructions

Zihao Zhang, Zonghai Yao, Huixue Zhou et al.

Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.

12.9CLMay 24
SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

Han Liu, Zhi Xu, Xiaotong Zhang et al.

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

CLAug 10, 2024
Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction

Jung Hoon Lim, Sunjae Kwon, Zonghai Yao et al.

Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.

CLNov 16, 2023
Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation

Zonghai Yao, Ahmed Jaafar, Beining Wang et al.

This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.

CLMar 3
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang et al.

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

CLFeb 27, 2024Code
JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

Junda Wang, Zhichao Yang, Zonghai Yao et al.

Large Language Models (LLMs) have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. However, LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. Previously, retrieval augmented generation (RAG) has limited success in addressing hallucinations. Unlike previous methods in RAG where the retrieval model was trained separately from the LLM, we introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase. The synchronized training mechanism enhances JMLR's ability to retrieve clinical guidelines and leverage medical knowledge to reason and answer questions and reduces the demand for computational resources. We evaluated JMLR on the important medical question-answering application. Our experimental results demonstrate that JMLR-13B (70.5%) outperforms a previous state-of-the-art open-source model using conventional pre-training and fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances reasoning quality and reduces hallucinations better than Claude3-Opus. Additionally, JMLR-13B (148 GPU hours) also trains much faster than Meditron-70B (42630 GPU hours). Through this work, we provide a new and efficient knowledge enhancement method for healthcare, demonstrating the potential of integrating retrieval and LLM training for medical question-answering systems.

CLNov 12, 2023
Large Language Models are In-context Teachers for Knowledge Reasoning

Jiachen Zhao, Zonghai Yao, Zhichao Yang et al.

In this work, we study in-context teaching (ICT), where a teacher provides in-context example rationales to teach a student to reason over unseen cases. Human teachers are usually required to craft in-context demonstrations, which are costly and have high variance. We ask whether a large language model (LLM) can serve as a more effective in-context teacher for itself or other LLMs, compared to humans. Inspired by the Encoding Specificity Hypothesis from human episodic memory, we hypothesize that in-context exemplars crafted by the teacher should match the training data of the student. This hypothesis motivates us to propose Self-Explain where an LLM's self-elicited explanations are used as in-context demonstrations for prompting it as they are generalized from the model's training examples. Self-Explain is shown to significantly outperform using human-crafted exemplars and other baselines. Furthermore, we reveal that for ICT, rationales from different teacher LLMs or human experts that more resemble the student LLM's self-explanations are better in-context demonstrations. This supports our encoding specificity hypothesis. We then propose Teach-Back that aligns a teacher LLM with the student to enhance the ICT performance. For example, Teach-Back enables a 7B model to teach the much larger GPT-3.5 in context, surpassing human teachers by around 5% in test accuracy on medical question answering.

50.4CVApr 14
HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Han Liu, Jiaqi Li, Zhi Xu et al.

Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.

56.9CLApr 11
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Han Liu, Haotian Gao, Xiaotong Zhang et al.

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.

CLNov 26, 2022
An Automatic SOAP Classification System Using Weakly Supervision And Transfer Learning

Sunjae Kwon, Zhichao Yang, Hong Yu

In this paper, we introduce a comprehensive framework for developing a machine learning-based SOAP (Subjective, Objective, Assessment, and Plan) classification system without manually SOAP annotated training data or with less manually SOAP annotated training data. The system is composed of the following two parts: 1) Data construction, 2) A neural network-based SOAP classifier, and 3) Transfer learning framework. In data construction, since a manual construction of a large size training dataset is expensive, we propose a rule-based weak labeling method utilizing the structured information of an EHR note. Then, we present a SOAP classifier composed of a pre-trained language model and bi-directional long-short term memory with conditional random field (Bi-LSTM-CRF). Finally, we propose a transfer learning framework that re-uses the trained parameters of the SOAP classifier trained with the weakly labeled dataset for datasets collected from another hospital. The proposed weakly label-based learning model successfully performed SOAP classification (89.99 F1-score) on the notes collected from the target hospital. Otherwise, in the notes collected from other hospitals and departments, the performance dramatically decreased. Meanwhile, we verified that the transfer learning framework is advantageous for inter-hospital adaptation of the model increasing the models' performance in every cases. In particular, the transfer learning approach was more efficient when the manually annotated data size was smaller. We showed that SOAP classification models trained with our weakly labeling algorithm can perform SOAP classification without manually annotated data on the EHR notes from the same hospital. The transfer learning framework helps SOAP classification model's inter-hospital migration with a minimal size of the manually annotated dataset.

98.0AIMay 21
Towards a General Intelligence and Interface for Wearable Health Data

Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari et al.

While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

CLJun 2, 2023
5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair

Jiarui Lu, Bo-Hsiang Tseng, Joel Ruben Antony Moniz et al.

Providing voice assistants the ability to navigate multi-turn conversations is a challenging problem. Handling multi-turn interactions requires the system to understand various conversational use-cases, such as steering, intent carryover, disfluencies, entity carryover, and repair. The complexity of this problem is compounded by the fact that these use-cases mix with each other, often appearing simultaneously in natural language. This work proposes a non-autoregressive query rewriting architecture that can handle not only the five aforementioned tasks, but also complex compositions of these use-cases. We show that our proposed model has competitive single task performance compared to the baseline approach, and even outperforms a fine-tuned T5 model in use-case compositions, despite being 15 times smaller in parameters and 25 times faster in latency.

CLAug 9, 2024
A Psychology-based Unified Dynamic Framework for Curriculum Learning

Guangyu Meng, Qingkai Zeng, John P. Lalor et al.

Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.

CLFeb 3, 2024Code
SynthDST: Synthetic Data is All You Need for Few-Shot Dialog State Tracking

Atharva Kulkarni, Bo-Hsiang Tseng, Joel Ruben Antony Moniz et al. · cmu

In-context learning with Large Language Models (LLMs) has emerged as a promising avenue of research in Dialog State Tracking (DST). However, the best-performing in-context learning methods involve retrieving and adding similar examples to the prompt, requiring access to labeled training data. Procuring such training data for a wide range of domains and applications is time-consuming, expensive, and, at times, infeasible. While zero-shot learning requires no training data, it significantly lags behind the few-shot setup. Thus, `\textit{Can we efficiently generate synthetic data for any dialogue schema to enable few-shot prompting?}' Addressing this question, we propose \method, a data generation framework tailored for DST, utilizing LLMs. Our approach only requires the dialogue schema and a few hand-crafted dialogue templates to synthesize natural, coherent, and free-flowing dialogues with DST annotations. Few-shot learning using data from {\method} results in $4-5%$ improvement in Joint Goal Accuracy over the zero-shot baseline on MultiWOZ 2.1 and 2.4. Remarkably, our few-shot learning approach recovers nearly $98%$ of the performance compared to the few-shot setup using human-annotated training data. Our synthetic data and code can be accessed at https://github.com/apple/ml-synthdst

HCJun 10, 2023
Referring to Screen Texts with Voice Assistants

Shruti Bhargava, Anand Dhoot, Ing-Marie Jonsson et al.

Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users' context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen, similar to visual grounding. We collect a dataset and propose a lightweight general-purpose model for this novel experience. Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI. Our model is modular, thus offering flexibility, improved interpretability, and efficient runtime memory utilization.