CLApr 5, 2023
Evaluation of ChatGPT Family of Models for Biomedical Reasoning and ClassificationShan Chen, Yingya Li, Sheng Lu et al. · harvard
Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI API public interface, we evaluated model performance with over 10000 samples as proxies for two fundamental tasks in the clinical domain - classification and reasoning. The first task is classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice. The second task is causal relation detection from the biomedical literature. We compared LLMs with simpler models, such as bag-of-words (BoW) with logistic regression, and fine-tuned BioBERT models. Despite the excitement around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks remained the best strategy. The simple BoW model performed on par with the most complex LLM prompting. Prompt engineering required significant investment.
CLSep 4, 2023
Are Emergent Abilities in Large Language Models just In-Context Learning?Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva et al.
Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abilities is that they are confounded by model competencies that arise through alternative prompting techniques, including in-context learning, which is the ability of models to complete a task based on a few examples. We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings suggest that purported emergent abilities are not truly emergent, but result from a combination of in-context learning, model memory, and linguistic knowledge. Our work is a foundational step in explaining language model performance, providing a template for their efficient use and clarifying the paradox of their ability to excel in some instances while faltering in others. Thus, we demonstrate that their capabilities should not be overestimated.
CLOct 18, 2023
Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-lySheng Lu, Shan Chen, Yingya Li et al. · harvard
In-context learning (ICL) is a new learning paradigm that has gained popularity along with the development of large language models. In this work, we adapt a recently proposed hardness metric, pointwise $\mathcal{V}$-usable information (PVI), to an in-context version (in-context PVI). Compared to the original PVI, in-context PVI is more efficient in that it requires only a few exemplars and does not require fine-tuning. We conducted a comprehensive empirical analysis to evaluate the reliability of in-context PVI. Our findings indicate that in-context PVI estimates exhibit similar characteristics to the original PVI. Specific to the in-context setting, we show that in-context PVI estimates remain consistent across different exemplar selections and numbers of shots. The variance of in-context PVI estimates across different exemplar selections is insignificant, which suggests that in-context PVI are stable. Furthermore, we demonstrate how in-context PVI can be employed to identify challenging instances. Our work highlights the potential of in-context PVI and provides new insights into the capabilities of ICL.
ETMay 29
Compact and Energy-Efficient Memristive Spiking Neuromorphic Accelerator for Bio-inspired Interception TasksQianhou Qu, Sheng Lu, Sungyong Jung et al.
Spiking neural networks (SNNs) provide an efficient event-driven computing paradigm for bio-inspired interception tasks. However, most implementations rely on von Neumann digital computing platforms, where memory and computation bottlenecks limit energy efficiency. This work presents a compact and energy-efficient memristive neuromorphic accelerator for bio-inspired interception tasks. A novel one-transistor-one-resistor (1T1R) crossbar array is designed to emulate synaptic operations in the in-memory computing (IMC) domain, while circuit-level optimization mitigates membrane drift and improves integration fidelity. In addition, an integrate-and-fire (IF) neuron with separated input and membrane nodes is developed to improve inference robustness during array-interfaced operation. Implemented in the SkyWater SKY130 PDK, the proposed neuron achieves an energy consumption of 10.67 pJ/spike and an area of 906 um^2. System-level results show that the memristive IMC output closely matches the software SNN baseline, with a correlation coefficient of 0.9622, while achieving a 96% interception success rate. These results demonstrate the effectiveness of the proposed design for compact and reliable memristive SNN inference in bio-inspired interception tasks.
NEMay 29
Memristor-Based Spiking Neural Network Accelerator for Bio-inspired Interception TaskQianhou Qu, Sheng Lu, Liuting Shang et al.
Spiking neural networks (SNNs) provide event-driven and low-power computation inspired by biological neural systems, but current implementations rely on von Neumann graphics processing units (GPUs) and central processing units (CPUs) platforms, where memory and computation bottlenecks limit energy efficiency. To address this challenge, this paper proposes an analog memristor-based spiking neural network (SNN) accelerator that integrates in-memory synaptic computation with analog integrate-and-fire (IF) neurons, eliminating multi-transistor CMOS synapse circuits and enabling asynchronous event-driven operation at the 45nm technology node. Additionally, a digital SNN accelerator is designed and optimized at the 5 nm technology node for comparison. The proposed architecture is evaluated using a predator-prey tracking task that emulates pursuit behavior. In this task, the analog SNN accelerator's inference closely matches the ideal software inference with a mean squared error (MSE) of 0.004. HSPICE simulation results show that the proposed analog SNN accelerator achieves 12.7 times lower energy consumption and 1.26 times lower delay compared to the digital baseline, demonstrating the potential of memristor-based neuromorphic circuits for energy-efficient real-time edge intelligence.
CLNov 13, 2023
How are Prompts Different in Terms of Sensitivity?Sheng Lu, Hendrik Schuff, Iryna Gurevych
In-context learning (ICL) has become one of the most popular learning paradigms. While there is a growing body of literature focusing on prompt engineering, there is a lack of systematic analysis comparing the effects of prompts across different models and tasks. To address this gap, we present a comprehensive prompt analysis based on the sensitivity of a function. Our analysis reveals that sensitivity is an unsupervised proxy for model performance, as it exhibits a strong negative correlation with accuracy. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output, resulting in different levels of sensitivity. Furthermore, we introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding. We show that this approach is particularly helpful when information in the input is scarce. Our work provides a fresh perspective on the analysis of prompts, and contributes to a better understanding of the mechanism of ICL.
CVDec 30, 2025Code
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video GenerationZhe Huang, Hao Wen, Aiming Hao et al.
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
CVMar 19
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer AnalysisSheng Lu, Hao Chen, Rui Yin et al.
Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
CVFeb 21Code
TAG: Thinking with Action Unit Grounding for Facial Expression RecognitionHaobo Lin, Tianyi Bai, Jiajun Zhang et al.
Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .
CLMay 10, 2024
What Can Natural Language Processing Do for Peer Review?Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen et al.
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.
CLMay 13, 2025
Towards Contamination Resistant BenchmarksRahmatullah Musawi, Sheng Lu
The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.
CLApr 9, 2025
Identifying Aspects in Peer ReviewsSheng Lu, Ilia Kuznetsov, Iryna Gurevych
Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspects from review forms and guidelines, yet data-driven methods for aspect identification are underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.