Daqing He

CL
h-index20
23papers
4,288citations
Novelty43%
AI Score57

23 Papers

CLAug 20, 2022Code
General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation

Rui Meng, Tong Wang, Xingdi Yuan et al. · microsoft-research

Training keyphrase generation (KPG) models require a large amount of annotated data, which can be prohibitively expensive and often limited to specific domains. In this study, we first demonstrate that large distribution shifts among different domains severely hinder the transferability of KPG models. We then propose a three-stage pipeline, which gradually guides KPG models' learning focus from general syntactical features to domain-related semantics, in a data-efficient manner. With Domain-general Phrase pre-training, we pre-train Sequence-to-Sequence models with generic phrase annotations that are widely available on the web, which enables the models to generate phrases in a wide range of domains. The resulting model is then applied in the Transfer Labeling stage to produce domain-specific pseudo keyphrases, which help adapt models to a new domain. Finally, we fine-tune the model with limited data with true labels to fully adapt it to the target domain. Our experiment results show that the proposed process can produce good-quality keyphrases in new domains and achieve consistent improvements after adaptation with limited in-domain annotated data. All code and datasets are available at https://github.com/memray/OpenNMT-kpg-release.

CLMay 23
StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

Yuelyu Ji, Zhuochun Li, Hui Ji et al.

We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $κ{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, while three of four LLM-only removals \emph{improve} F1 -- a sign of \emph{competing-error cancellation}, where internal stages mask each other's errors. We further expose a \emph{Q-F1 trap}: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from $32.1{\pm}0.3$ to $35.4{\pm}0.9$ across three seeds, with the single-run comparison showing a $+5.6$ Avg EM gain over the matched Search-R1 GRPO reproduction.

CLJan 30
Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Zhuochun Li, Yong Zhang, Ming Li et al.

Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

CLJan 2
Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

Yuelyu Ji, Zhuochun Li, Rui Meng et al.

Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.

AISep 28, 2025Code
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang et al.

Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

CLApr 23, 2017Code
Deep Keyphrase Generation

Rui Meng, Sanqiang Zhao, Shuguang Han et al.

Keyphrase provides highly-condensed information that can be effectively used for understanding, organizing and retrieving text content. Though previous studies have provided many workable solutions for automated keyphrase extraction, they commonly divided the to-be-summarized content into multiple text chunks, then ranked and selected the most meaningful ones. These approaches could neither identify keyphrases that do not appear in the text, nor capture the real semantic meaning behind the text. We propose a generative model for keyphrase prediction with an encoder-decoder framework, which can effectively overcome the above drawbacks. We name it as deep keyphrase generation since it attempts to capture the deep semantic meaning of the content with a deep learning method. Empirical analysis on six datasets demonstrates that our proposed model not only achieves a significant performance boost on extracting keyphrases that appear in the source text, but also can generate absent keyphrases based on the semantic meaning of the text. Code and dataset are available at https://github.com/memray/OpenNMT-kpg-release.

CRJan 9
Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

Zhaoqi Wang, Zijian Zhang, Daqing He et al.

Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underline{i}nteractive \underline{M}ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues guided by real-time harmfulness assessment. Our experiments on widely-used models demonstrate that iMIST achieves higher attack effectiveness, while maintaining low rejection rates. These results reveal critical vulnerabilities in current LLM safety mechanisms and underscore the urgent need for more robust defense strategies.

CLMay 21, 2024
RAG-RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and Readability Control for Layman Summarization of Biomedical Texts

Yuelyu Ji, Zhuochun Li, Rui Meng et al.

This paper introduces the RAG-RLRC-LaySum framework, designed to make complex biomedical research understandable to laymen through advanced Natural Language Processing (NLP) techniques. Our Retrieval Augmented Generation (RAG) solution, enhanced by a reranking method, utilizes multiple knowledge sources to ensure the precision and pertinence of lay summaries. Additionally, our Reinforcement Learning for Readability Control (RLRC) strategy improves readability, making scientific content comprehensible to non-specialists. Evaluations using the publicly accessible PLOS and eLife datasets show that our methods surpass Plain Gemini model, demonstrating a 20% increase in readability scores, a 15% improvement in ROUGE-2 relevance scores, and a 10% enhancement in factual accuracy. The RAG-RLRC-LaySum framework effectively democratizes scientific knowledge, enhancing public engagement with biomedical discoveries.

CLApr 5, 2024
Effects of Different Prompts on the Quality of GPT-4 Responses to Dementia Care Questions

Zhuochun Li, Bo Xie, Robin Hilsabeck et al.

Evidence suggests that different prompts lead large language models (LLMs) to generate responses with varying quality. Yet, little is known about prompts' effects on response quality in healthcare domains. In this exploratory study, we address this gap, focusing on a specific healthcare domain: dementia caregiving. We first developed an innovative prompt template with three components: (1) system prompts (SPs) featuring 4 different roles; (2) an initialization prompt; and (3) task prompts (TPs) specifying different levels of details, totaling 12 prompt combinations. Next, we selected 3 social media posts containing complicated, real-world questions about dementia caregivers' challenges in 3 areas: memory loss and confusion, aggression, and driving. We then entered these posts into GPT-4, with our 12 prompts, to generate 12 responses per post, totaling 36 responses. We compared the word count of the 36 responses to explore potential differences in response length. Two experienced dementia care clinicians on our team assessed the response quality using a rating scale with 5 quality indicators: factual, interpretation, application, synthesis, and comprehensiveness (scoring range: 0-5; higher scores indicate higher quality).

CLMay 23, 2025
Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

Yuelyu Ji, Rui Meng, Zhuochun Li et al.

Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.

CLMar 29, 2025
Memory-Aware and Uncertainty-Guided Retrieval for Multi-Hop Question Answering

Yuelyu Ji, Rui Meng, Zhuochun Li et al.

Multi-hop question answering (QA) requires models to retrieve and reason over multiple pieces of evidence. While Retrieval-Augmented Generation (RAG) has made progress in this area, existing methods often suffer from two key limitations: (1) fixed or overly frequent retrieval steps, and (2) ineffective use of previously retrieved knowledge. We propose MIND (Memory-Informed and INteractive Dynamic RAG), a framework that addresses these challenges through: (i) prompt-based entity extraction to identify reasoning-relevant elements, (ii) dynamic retrieval triggering based on token-level entropy and attention signals, and (iii) memory-aware filtering, which stores high-confidence facts across reasoning steps to enable consistent multi-hop generation.

CLSep 29, 2025
Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Zhimeng Luo, Lixin Wu, Adam Frisch et al.

As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

CLSep 2, 2025
Weakly Supervised Medical Entity Extraction and Linking for Chief Complaints

Zhimeng Luo, Zhendong Wang, Rui Meng et al.

A Chief complaint (CC) is the reason for the medical visit as stated in the patient's own words. It helps medical professionals to quickly understand a patient's situation, and also serves as a short summary for medical text mining. However, chief complaint records often take a variety of entering methods, resulting in a wide variation of medical notations, which makes it difficult to standardize across different medical institutions for record keeping or text mining. In this study, we propose a weakly supervised method to automatically extract and link entities in chief complaints in the absence of human annotation. We first adopt a split-and-match algorithm to produce weak annotations, including entity mention spans and class labels, on 1.2 million real-world de-identified and IRB approved chief complaint records. Then we train a BERT-based model with generated weak labels to locate entity mentions in chief complaint text and link them to a pre-defined ontology. We conducted extensive experiments, and the results showed that our Weakly Supervised Entity Extraction and Linking (\ours) method produced superior performance over previous methods without any human annotation.

CLSep 2, 2025
Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning

Zhimeng Luo, Abhibha Gupta, Adam Frisch et al.

The extraction of critical patient information from Electronic Health Records (EHRs) poses significant challenges due to the complexity and unstructured nature of the data. Traditional machine learning approaches often fail to capture pertinent details efficiently, making it difficult for clinicians to utilize these tools effectively in patient care. This paper introduces a novel approach to extracting the OPQRST assessment from EHRs by leveraging the capabilities of Large Language Models (LLMs). We propose to reframe the task from sequence labeling to text generation, enabling the models to provide reasoning steps that mimic a physician's cognitive processes. This approach enhances interpretability and adapts to the limited availability of labeled data in healthcare settings. Furthermore, we address the challenge of evaluating the accuracy of machine-generated text in clinical contexts by proposing a modification to traditional Named Entity Recognition (NER) metrics. This includes the integration of semantic similarity measures, such as the BERT Score, to assess the alignment between generated text and the clinical intent of the original records. Our contributions demonstrate a significant advancement in the use of AI in healthcare, offering a scalable solution that improves the accuracy and usability of information extraction from EHRs, thereby aiding clinicians in making more informed decisions and enhancing patient care outcomes.

CLJul 19, 2025
A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

Jinyu Liu, Xiaoying Song, Diana Zhang et al.

Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.

CLMay 31, 2021
Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents

Rui Meng, Khushboo Thaker, Lei Zhang et al.

Faceted summarization provides briefings of a document from different perspectives. Readers can quickly comprehend the main points of a long document with the help of a structured outline. However, little research has been conducted on this subject, partially due to the lack of large-scale faceted summarization datasets. In this study, we present FacetSum, a faceted summarization benchmark built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value. Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries. We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems that can leverage the structured information in both long texts and summaries.

CLSep 22, 2020
An Empirical Study on Neural Keyphrase Generation

Rui Meng, Xingdi Yuan, Tong Wang et al.

Recent years have seen a flourishing of neural keyphrase generation (KPG) works, including the release of several large-scale datasets and a host of new models to tackle them. Model performance on KPG tasks has increased significantly with evolving deep learning research. However, there lacks a comprehensive comparison among different model designs, and a thorough investigation on related factors that may affect a KPG system's generalization performance. In this empirical study, we aim to fill this gap by providing extensive experimental results and analyzing the most crucial factors impacting the generalizability of KPG models. We hope this study can help clarify some of the uncertainties surrounding the KPG task and facilitate future research on this topic.

IRMay 22, 2020
Concept Annotation for Intelligent Textbooks

Mengdi Wang, Hung Chau, Khushboo Thaker et al.

With the increased popularity of electronic textbooks, there is a growing interests in developing a new generation of "intelligent textbooks", which have the ability to guide the readers according to their learning goals and current knowledge. The intelligent textbooks extend regular textbooks by integrating machine-manipulatable knowledge such as a knowledge map or a prerequisite-outcome relationship between sections, among which, the most popular integrated knowledge is a list of unique knowledge concepts associated with each section. With the help of this concept, multiple intelligent operations, such as content linking, content recommendation or student modeling, can be performed. However, annotating a reliable set of concepts to a textbook section is a challenge. Automatic unsupervised methods for extracting key-phrases as the concepts are known to have insufficient accuracy. Manual annotation by experts is considered as a preferred approach and can be used to produce both the target outcome and the labeled data for training supervised models. However, most researchers in education domain still consider the concept annotation process as an ad-hoc activity rather than an engineering task, resulting in low-quality annotated data. In this paper, we present a textbook knowledge engineering method to obtain reliable concept annotations. The outcomes of our work include a validated knowledge engineering procedure, a code-book for technical concept annotation, and a set of concept annotations for the target textbook, which could be used as gold standard in further research.

CLSep 9, 2019
Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence

Rui Meng, Xingdi Yuan, Tong Wang et al.

Recently, concatenating multiple keyphrases as a target sequence has been proposed as a new learning paradigm for keyphrase generation. Existing studies concatenate target keyphrases in different orders but no study has examined the effects of ordering on models' behavior. In this paper, we propose several orderings for concatenation and inspect the important factors for training a successful keyphrase generation model. By running comprehensive comparisons, we observe one preferable ordering and summarize a number of empirical findings and challenges, which can shed light on future research on this line of work.

CLOct 26, 2018
Integrating Transformer and Paraphrase Rules for Sentence Simplification

Sanqiang Zhao, Rui Meng, Daqing He et al.

Sentence simplification aims to reduce the complexity of a sentence while retaining its original meaning. Current models for sentence simplification adopted ideas from ma- chine translation studies and implicitly learned simplification mapping rules from normal- simple sentence pairs. In this paper, we explore a novel model based on a multi-layer and multi-head attention architecture and we pro- pose two innovative approaches to integrate the Simple PPDB (A Paraphrase Database for Simplification), an external paraphrase knowledge base for simplification that covers a wide range of real-world simplification rules. The experiments show that the integration provides two major benefits: (1) the integrated model outperforms multiple state- of-the-art baseline models for sentence simplification in the literature (2) through analysis of the rule utilization, the model seeks to select more accurate simplification rules. The code and models used in the paper are available at https://github.com/ Sanqiang/text_simplification.

CLOct 11, 2018
One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases

Xingdi Yuan, Tong Wang, Rui Meng et al.

Different texts shall by nature correspond to different number of keyphrases. This desideratum is largely missing from existing neural keyphrase generation models. In this study, we address this problem from both modeling and evaluation perspectives. We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs. We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset StackEx that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

IRSep 19, 2014
Benchmarking the Privacy-Preserving People Search

Shuguang Han, Daqing He, Zhen Yue

People search is an important topic in information retrieval. Many previous studies on this topic employed social networks to boost search performance by incorporating either local network features (e.g. the common connections between the querying user and candidates in social networks), or global network features (e.g. the PageRank), or both. However, the available social network information can be restricted because of the privacy settings of involved users, which in turn would affect the performance of people search. Therefore, in this paper, we focus on the privacy issues in people search. We propose simulating different privacy settings with a public social network due to the unavailability of privacy-concerned networks. Our study examines the influences of privacy concerns on the local and global network features, and their impacts on the performance of people search. Our results show that: 1) the privacy concerns of different people in the networks have different influences. People with higher association (i.e. higher degree in a network) have much greater impacts on the performance of people search; 2) local network features are more sensitive to the privacy concerns, especially when such concerns come from high association peoples in the network who are also related to the querying user. As the first study on this topic, we hope to generate further discussions on these issues.

IRApr 6, 2013
Automatic Detection of Search Tactic in Individual Information Seeking: A Hidden Markov Model Approach

Shuguang Han, Zhen Yue, Daqing He

Information seeking process is an important topic in information seeking behavior research. Both qualitative and empirical methods have been adopted in analyzing information seeking processes, with major focus on uncovering the latent search tactics behind user behaviors. Most of the existing works require defining search tactics in advance and coding data manually. Among the few works that can recognize search tactics automatically, they missed making sense of those tactics. In this paper, we proposed using an automatic technique, i.e. the Hidden Markov Model (HMM), to explicitly model the search tactics. HMM results show that the identified search tactics of individual information seeking behaviors are consistent with Marchioninis Information seeking process model. With the advantages of showing the connections between search tactics and search actions and the transitions among search tactics, we argue that HMM is a useful tool to investigate information seeking process, or at least it provides a feasible way to analyze large scale dataset.