CLOct 11, 2022Code
A Win-win Deal: Towards Sparse and Robust Pre-trained Language ModelsYuanxin Liu, Fandong Meng, Zheng Lin et al. · pku, tsinghua
Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks to the OOD setting, investigating whether sparsity and robustness to dataset bias can be achieved simultaneously. To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that \textbf{sparse and robust subnetworks (SRNets) can consistently be found in BERT}, across the aforementioned three scenarios, using different training and compression methods. Furthermore, we explore the upper bound of SRNets using the OOD information and show that \textbf{there exist sparse and almost unbiased BERT subnetworks}. Finally, we present 1) an analytical study that provides insights on how to promote the efficiency of SRNets searching process and 2) a solution to improve subnetworks' performance at high sparsity. The code is available at https://github.com/llyx97/sparse-and-robust-PLM.
CLMay 2, 2022
Neutral Utterances are Also Causes: Enhancing Conversational Causal Emotion Entailment with Social Commonsense KnowledgeJiangnan Li, Fandong Meng, Zheng Lin et al. · tsinghua
Conversational Causal Emotion Entailment aims to detect causal utterances for a non-neutral targeted utterance from a conversation. In this work, we build conversations as graphs to overcome implicit contextual modelling of the original entailment style. Following the previous work, we further introduce the emotion information into graphs. Emotion information can markedly promote the detection of causal utterances whose emotion is the same as the targeted utterance. However, it is still hard to detect causal utterances with different emotions, especially neutral ones. The reason is that models are limited in reasoning causal clues and passing them between utterances. To alleviate this problem, we introduce social commonsense knowledge (CSK) and propose a Knowledge Enhanced Conversation graph (KEC). KEC propagates the CSK between two utterances. As not all CSK is emotionally suitable for utterances, we therefore propose a sentiment-realized knowledge selecting strategy to filter CSK. To process KEC, we further construct the Knowledge Enhanced Directed Acyclic Graph networks. Experimental results show that our method outperforms baselines and infers more causes with different emotions from the targeted utterance.
CLOct 21, 2022
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge SelectionLanrui Wang, Jiangnan Li, Zheng Lin et al. · tsinghua
Empathy, which is widely used in psychological counselling, is a key trait of everyday human conversations. Equipped with commonsense knowledge, current approaches to empathetic response generation focus on capturing implicit emotion within dialogue context, where the emotions are treated as a static variable throughout the conversations. However, emotions change dynamically between utterances, which makes previous works difficult to perceive the emotion flow and predict the correct emotion of the target response, leading to inappropriate response. Furthermore, simply importing commonsense knowledge without harmonization may trigger the conflicts between knowledge and emotion, which confuse the model to choose incorrect information to guide the generation process. To address the above problems, we propose a Serial Encoding and Emotion-Knowledge interaction (SEEK) method for empathetic dialogue generation. We use a fine-grained encoding strategy which is more sensitive to the emotion dynamics (emotion flow) in the conversations to predict the emotion-intent characteristic of response. Besides, we design a novel framework to model the interaction between knowledge and emotion to generate more sensible response. Extensive experiments on EmpatheticDialogues demonstrate that SEEK outperforms the strong baselines in both automatic and manual evaluations.
CLOct 26, 2022
Question-Interlocutor Scope Realized Graph Modeling over Key Utterances for Dialogue Reading ComprehensionJiangnan Li, Mo Yu, Fandong Meng et al. · ibm-research, tsinghua
In this work, we focus on dialogue reading comprehension (DRC), a task extracting answer spans for questions from dialogues. Dialogue context modeling in DRC is tricky due to complex speaker information and noisy dialogue context. To solve the two problems, previous research proposes two self-supervised tasks respectively: guessing who a randomly masked speaker is according to the dialogue and predicting which utterance in the dialogue contains the answer. Although these tasks are effective, there are still urging problems: (1) randomly masking speakers regardless of the question cannot map the speaker mentioned in the question to the corresponding speaker in the dialogue, and ignores the speaker-centric nature of utterances. This leads to wrong answer extraction from utterances in unrelated interlocutors' scopes; (2) the single utterance prediction, preferring utterances similar to the question, is limited in finding answer-contained utterances not similar to the question. To alleviate these problems, we first propose a new key utterances extracting method. It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances. Based on utterances in the extracted units, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling. As a graph constructed on the text of utterances, QuISG additionally involves the question and question-mentioning speaker names as nodes. To realize interlocutor scopes, speakers in the dialogue are connected with the words in their corresponding utterances. Experiments on the benchmarks show that our method can achieve better and competitive results against previous works.
CVAug 31, 2023Code
Separate and Locate: Rethink the Text in Text-based Visual Question AnsweringChengyang Fang, Jiangnan Li, Liang Li et al.
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.
IRNov 3, 2023
Plot Retrieval as an Assessment of Abstract Semantic AssociationShicheng Xu, Liang Pang, Jiangnan Li et al.
Retrieving relevant plots from the book for a query is a critical task, which can improve the reading experience and efficiency of readers. Readers usually only give an abstract and vague description as the query based on their own understanding, summaries, or speculations of the plot, which requires the retrieval model to have a strong ability to estimate the abstract semantic associations between the query and candidate plots. However, existing information retrieval (IR) datasets cannot reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task Plot Retrieval. Text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can reflect the ability of the IR models to estimate the abstract semantic association, rather than just traditional lexical or semantic matching. Extensive experiments across various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods compared with human studies on Plot Retrieval show current IR models still struggle in capturing abstract semantic association between texts. Plot Retrieval can be the benchmark for further research on the semantic association modeling ability of IR models.
CLApr 21
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story ComprehensionJunjie Wu, Jiangnan Li, Yuqing Li et al.
Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
CLNov 26, 2023
Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense InferenceLanrui Wang, Jiangnan Li, Chenxu Yang et al.
Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses.
CLOct 13, 2023
Multi-level Adaptive Contrastive Learning for Knowledge Internalization in Dialogue GenerationChenxu Yang, Zheng Lin, Lanrui Wang et al.
Knowledge-grounded dialogue generation aims to mitigate the issue of text degeneration by incorporating external knowledge to supplement the context. However, the model often fails to internalize this information into responses in a human-like manner. Instead, it simply inserts segments of the provided knowledge into generic responses. As a result, the generated responses tend to be tedious, incoherent, and in lack of interactivity which means the degeneration problem is still unsolved. In this work, we first find that such copying-style degeneration is primarily due to the weak likelihood objective, which allows the model to "cheat" the objective by merely duplicating knowledge segments in a superficial pattern matching based on overlap. To overcome this challenge, we then propose a Multi-level Adaptive Contrastive Learning (MACL) framework that dynamically samples negative examples and subsequently penalizes degeneration behaviors at both the token-level and sequence-level. Extensive experiments on the WoW dataset demonstrate the effectiveness of our approach across various pre-trained models.
CLDec 19, 2025
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context UnderstandingYuqing Li, Jiangnan Li, Zheng Lin et al.
Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
CLFeb 12
Query-focused and Memory-aware Reranker for Long Context ProcessingYuqing Li, Jiangnan Li, Mo Yu et al.
Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
CLJan 22
ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News DetectionGuoxuan Ding, Yuqing Li, Ziyan Zhou et al.
The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
CVApr 20
Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous LabelsJiangnan Li, Linqing Huang, Xiaowen Yan et al.
Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.
CLJun 10, 2025Code
Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of EmbeddingsLiyan Xu, Zhenlin Su, Mo Yu et al.
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
CLMay 17, 2023Code
Personality Understanding of Fictional Characters during Book ReadingMo Yu, Jiangnan Li, Shunyu Yao et al.
Comprehending characters' personalities is a crucial aspect of story reading. As readers engage with a story, their understanding of a character evolves based on new events and information; and multiple fine-grained aspects of personalities can be perceived. This leads to a natural problem of situated and fine-grained personality understanding. The problem has not been studied in the NLP field, primarily due to the lack of appropriate datasets mimicking the process of book reading. We present the first labeled dataset PersoNet for this problem. Our novel annotation strategy involves annotating user notes from online reading apps as a proxy for the original books. Experiments and human studies indicate that our dataset construction is both efficient and accurate; and our task heavily relies on long-term context to achieve accurate predictions for both machines and humans. The dataset is available at https://github.com/Gorov/personet_acl23.
CLMay 7
MiA-Signature: Approximating Global Activation for Long-Context UnderstandingYuqing Li, Jiangnan Li, Mo Yu et al.
A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of \textbf{Mindscape Activation Signature (MiA-Signature)}, a compressed representation of the global activation pattern induced by a query. In LLM systems, this is instantiated via submodular-based selection of high-level concepts that cover the activated context space, optionally refined through lightweight iterative updates using working memory. The resulting MiA-Signature serves as a conditioning signal that approximates the effect of the full activation state while remaining computationally tractable. Integrating MiA-Signatures into both RAG and agentic systems yields consistent performance gains across multiple long-context understanding tasks.
CLFeb 21, 2024
Fine-Grained Modeling of Narrative Context: A Coherence Perspective via Retrospective QuestionsLiyan Xu, Jiangnan Li, Mo Yu et al.
This work introduces an original and practical paradigm for narrative comprehension, stemming from the characteristics that individual passages within narratives tend to be more cohesively related than isolated. Complementary to the common end-to-end paradigm, we propose a fine-grained modeling of narrative context, by formulating a graph dubbed NarCo, which explicitly depicts task-agnostic coherence dependencies that are ready to be consumed by various downstream tasks. In particular, edges in NarCo encompass free-form retrospective questions between context snippets, inspired by human cognitive perception that constantly reinstates relevant events from prior context. Importantly, our graph formalism is practically instantiated by LLMs without human annotations, through our designed two-stage prompting scheme. To examine the graph properties and its utility, we conduct three studies in narratives, each from a unique angle: edge relation efficacy, local context enrichment, and broader application in QA. All tasks could benefit from the explicit coherence captured by NarCo.
CLFeb 13, 2025
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept UnderstandingMo Yu, Lemao Liu, Junjie Wu et al.
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
CLDec 22, 2023
SIG: Speaker Identification in Literature via Prompt-Based GenerationZhenlin Su, Liyan Xu, Jin Xu et al.
Identifying speakers of quotations in narratives is an important task in literary analysis, with challenging scenarios including the out-of-domain inference for unseen speakers, and non-explicit cases where there are no speaker mentions in surrounding context. In this work, we propose a simple and effective approach SIG, a generation-based method that verbalizes the task and quotation input based on designed prompt templates, which also enables easy integration of other auxiliary tasks that further bolster the speaker identification performance. The prediction can either come from direct generation by the model, or be determined by the highest generation probability of each speaker candidate. Based on our approach design, SIG supports out-of-domain evaluation, and achieves open-world classification paradigm that is able to accept any forms of candidate input. We perform both cross-domain evaluation and in-domain evaluation on PDNC, the largest dataset of this task, where empirical results suggest that SIG outperforms previous baselines of complicated designs, as well as the zero-shot ChatGPT, especially excelling at those hard non-explicit scenarios by up to 17% improvement. Additional experiments on another dataset WP further corroborate the efficacy of SIG.
CLFeb 11, 2024
Previously on the Stories: Recap Snippet Identification for Story ReadingJiangnan Li, Qiujing Wang, Liyan Xu et al.
Similar to the "previously-on" scenes in TV shows, recaps can help book reading by recalling the readers' memory about the important elements in previous texts to better understand the ongoing plot. Despite its usefulness, this application has not been well studied in the NLP community. We propose the first benchmark on this useful task called Recap Snippet Identification with a hand-crafted evaluation dataset. Our experiments show that the proposed task is challenging to PLMs, LLMs, and proposed methods as the task requires a deep understanding of the plot correlation between snippets.
CLJan 3, 2025
The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story CharactersChulun Zhou, Qiujing Wang, Mo Yu et al.
Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others' thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines' ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios. To achieve this, we introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 and DeepSeek-R1 models, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.
CLAug 13, 2025
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long ContextsMo Yu, Tsz Ting Chung, Chulun Zhou et al.
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
CVMar 5
CLIP-driven Zero-shot Learning with Ambiguous LabelsJinfu Fan, Jiangnan Li, Xiaowen Yan et al.
Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.
LGOct 2, 2025
Beyond Imitation: Recovering Dense Rewards from DemonstrationsJiangnan Li, Thuy-Trang Vu, Ehsan Abbasnejad et al.
Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.
LGJul 1, 2025
Diffusion Disambiguation Models for Partial Label LearningJinfu Fan, Xiaohui Zhong, Kangrui Ren et al.
Learning from ambiguous labels is a long-standing problem in practical machine learning applications. The purpose of \emph{partial label learning} (PLL) is to identify the ground-truth label from a set of candidate labels associated with a given instance. Inspired by the remarkable performance of diffusion models in various generation tasks, this paper explores their potential to denoise ambiguous labels through the reverse denoising process. Therefore, this paper reformulates the label disambiguation problem from the perspective of generative models, where labels are generated by iteratively refining initial random guesses. This perspective enables the diffusion model to learn how label information is generated stochastically. By modeling the generation uncertainty, we can use the maximum likelihood estimate of the label for classification inference. However, such ambiguous labels lead to a mismatch between instance and label, which reduces the quality of generated data. To address this issue, this paper proposes a \emph{diffusion disambiguation model for PLL} (DDMP), which first uses the potential complementary information between instances and labels to construct pseudo-clean labels for initial diffusion training. Furthermore, a transition-aware matrix is introduced to estimate the potential ground-truth labels, which are dynamically updated during the diffusion generation. During training, the ground-truth label is progressively refined, improving the classifier. Experiments show the advantage of the DDMP and its suitability for PLL.
CLMar 31, 2025
CONGRAD:Conflicting Gradient Filtering for Multilingual Preference AlignmentJiangnan Li, Thuy-Trang Vu, Christian Herold et al.
Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.
LGOct 19, 2024
NeuralMAG: Fast and Generalizable Micromagnetic Simulation with Deep Neural NetsYunqi Cai, Jiangnan Li, Dong Wang
Micromagnetics has made significant strides, particularly due to its wide-ranging applications in magnetic storage design. Numerical simulation is a cornerstone of micromagnetics research, relying on first-principle rules to compute the dynamic evolution of micromagnetic systems based on the renowned LLG equation, named after Landau, Lifshitz, and Gilbert. However, simulations are often hindered by their slow speed. Although Fast-Fourier transformation (FFT) calculations reduce the computational complexity to O(NlogN), it remains impractical for large-scale simulations. In this paper, we introduce NeuralMAG, a deep learning approach to micromagnetic simulation. Our approach follows the LLG iterative framework but accelerates demagnetizing field computation through the employment of a U-shaped neural network (Unet). The Unet architecture comprises an encoder that extracts aggregated spins at various scales and learns the local interaction at each scale, followed by a decoder that accumulates the local interactions at different scales to approximate the global convolution. This divide-and-accumulate scheme achieves a time complexity of O(N), significantly enhancing the speed and feasibility of large-scale simulations. Unlike existing neural methods, NeuralMAG concentrates on the core computation rather than an end-to-end approximation for a specific task, making it inherently generalizable. To validate the new approach, we trained a single model and evaluated it on two micromagnetics tasks with various sample sizes, shapes, and material settings.
CLJun 7, 2024
Think out Loud: Emotion Deducing Explanation in DialoguesJiangnan Li, Zheng Lin, Lanrui Wang et al.
Humans convey emotions through daily dialogues, making emotion understanding a crucial step of affective intelligence. To understand emotions in dialogues, machines are asked to recognize the emotion for an utterance (Emotion Recognition in Dialogues, ERD); based on the emotion, then find causal utterances for the emotion (Emotion Cause Extraction in Dialogues, ECED). The setting of the two tasks requires first ERD and then ECED, ignoring the mutual complement between emotion and cause. To fix this, some new tasks are proposed to extract them simultaneously. Although the current research on these tasks has excellent achievements, simply identifying emotion-related factors by classification modeling lacks realizing the specific thinking process of causes stimulating the emotion in an explainable way. This thinking process especially reflected in the reasoning ability of Large Language Models (LLMs) is under-explored. To this end, we propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN). EDEN recognizes emotion and causes in an explicitly thinking way. That is, models need to generate an explanation text, which first summarizes the causes; analyzes the inner activities of the speakers triggered by the causes using common sense; then guesses the emotion accordingly. To support the study of EDEN, based on the existing resources in ECED, we construct two EDEN datasets by human effort. We further evaluate different models on EDEN and find that LLMs are more competent than conventional PLMs. Besides, EDEN can help LLMs achieve better recognition of emotions and causes, which explores a new research direction of explainable emotion understanding in dialogues.
CRFeb 17, 2021
Towards Adversarial-Resilient Deep Neural Networks for False Data Injection Attack Detection in Power GridsJiangnan Li, Yingyuan Yang, Jinyuan Stella Sun et al.
False data injection attacks (FDIAs) pose a significant security threat to power system state estimation. To detect such attacks, recent studies have proposed machine learning (ML) techniques, particularly deep neural networks (DNNs). However, most of these methods fail to account for the risk posed by adversarial measurements, which can compromise the reliability of DNNs in various ML applications. In this paper, we present a DNN-based FDIA detection approach that is resilient to adversarial attacks. We first analyze several adversarial defense mechanisms used in computer vision and show their inherent limitations in FDIA detection. We then propose an adversarial-resilient DNN detection framework for FDIA that incorporates random input padding in both the training and inference phases. Our simulations, based on an IEEE standard power system, demonstrate that this framework significantly reduces the effectiveness of adversarial attacks while having a negligible impact on the DNNs' detection performance.
CLDec 29, 2020
A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in ConversationJiangnan Li, Zheng Lin, Peng Fu et al.
Emotion Recognition in Conversation (ERC) is a more challenging task than conventional text emotion recognition. It can be regarded as a personalized and interactive emotion recognition task, which is supposed to consider not only the semantic information of text but also the influences from speakers. The current method models speakers' interactions by building a relation between every two speakers. However, this fine-grained but complicated modeling is computationally expensive, hard to extend, and can only consider local context. To address this problem, we simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies, without identifying every unique speaker for the targeted speaker. To better achieve the simplified interaction modeling of speakers in Transformer, which shows excellent ability to settle long-distance dependency, we design three types of masks and respectively utilize them in three independent Transformer blocks. The designed masks respectively model the conventional context modeling, Intra-Speaker dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware information extracted by Transformer blocks diversely contributes to the prediction, and therefore we utilize the attention mechanism to automatically weight them. Experiments on two ERC datasets indicate that our model is efficacious to achieve better performance.
CLDec 3, 2020
Learning Class-Transductive Intent Representations for Zero-shot Intent DetectionQingyi Si, Yuanxin Liu, Peng Fu et al.
Zero-shot intent detection (ZSID) aims to deal with the continuously emerging intents without annotated training data. However, existing ZSID systems suffer from two limitations: 1) They are not good at modeling the relationship between seen and unseen intents. 2) They cannot effectively recognize unseen intents under the generalized intent detection (GZSID) setting. A critical problem behind these limitations is that the representations of unseen intents cannot be learned in the training stage. To address this problem, we propose a novel framework that utilizes unseen class labels to learn Class-Transductive Intent Representations (CTIR). Specifically, we allow the model to predict unseen intents during training, with the corresponding label names serving as input utterances. On this basis, we introduce a multi-task learning objective, which encourages the model to learn the distinctions among intents, and a similarity scorer, which estimates the connections among intents more accurately. CTIR is easy to implement and can be integrated with existing methods. Experiments on two real-world datasets show that CTIR brings considerable improvement to the baseline systems.
CROct 16, 2020
Exploiting Vulnerabilities of Deep Learning-based Energy Theft Detection in AMI through Adversarial AttacksJiangnan Li, Yingyuan Yang, Jinyuan Stella Sun
Effective detection of energy theft can prevent revenue losses of utility companies and is also important for smart grid security. In recent years, enabled by the massive fine-grained smart meter data, deep learning (DL) approaches are becoming popular in the literature to detect energy theft in the advanced metering infrastructure (AMI). However, as neural networks are shown to be vulnerable to adversarial examples, the security of the DL models is of concern. In this work, we study the vulnerabilities of DL-based energy theft detection through adversarial attacks, including single-step attacks and iterative attacks. From the attacker's point of view, we design the \textit{SearchFromFree} framework that consists of 1) a randomly adversarial measurement initialization approach to maximize the stolen profit and 2) a step-size searching scheme to increase the performance of black-box iterative attacks. The evaluation based on three types of neural networks shows that the adversarial attacker can report extremely low consumption measurements to the utility without being detected by the DL models. We finally discuss the potential defense mechanisms against adversarial attacks in energy theft detection.
CRJun 15, 2020
BubbleMap: Privilege Mapping for Behavior-based Implicit Authentication SystemsYingyuan Yang, Xueli Huang, Jiangnan Li et al.
Leveraging users' behavioral data sampled by various sensors during the identification process, implicit authentication (IA) relieves users from explicit actions such as remembering and entering passwords. Various IA schemes have been proposed based on different behavioral and contextual features such as gait, touch, and GPS. However, existing IA schemes suffer from false positives, i.e., falsely accepting an adversary, and false negatives, i.e., falsely rejecting the legitimate user due to users' behavior change and noise. To deal with this problem, we propose BubbleMap (BMap), a framework that can be seamlessly incorporated into any existing IA system to balance between security (reducing false positives) and usability (reducing false negatives) as well as reducing the equal error rate (EER). To evaluate the proposed framework, we implemented BMap on five state-of-the-art IA systems. We also conducted an experiment in a real-world environment from 2016 to 2020. Most of the experimental results show that BMap can greatly enhance the IA schemes' performances in terms of the EER, security, and usability, with a small amount of penalty on energy consumption.
CRJun 13, 2020
EchoIA: Implicit Authentication System Based on User FeedbackYingyuan Yang, Xueli Huang, Jiangnan Li et al.
Implicit authentication (IA) transparently authenticates users by utilizing their behavioral data sampled from various sensors. Identifying the illegitimate user through constantly analyzing current users' behavior, IA adds another layer of protection to the smart device. Due to the diversity of human behavior, the existing research works tend to simultaneously utilize many different features to identify users, which is less efficient. Irrelevant features may increase system delay and reduce the authentication accuracy. However, dynamically choosing the best suitable features for each user (personal features) requires a massive calculation, especially in the real environment. In this paper, we proposed EchoIA to find personal features with a small amount of calculation by utilizing user feedback. In the authentication phase, our approach maintains the transparency, which is the major advantage of IA. In the past two years, we conducted a comprehensive experiment to evaluate EchoIA. We compared it with other state-of-the-art IA schemes in the aspect of authentication accuracy and efficiency. The experiment results show that EchoIA has better authentication accuracy (93\%) and less energy consumption (23-hour battery lifetimes) than other IA schemes.
SPJun 2, 2020
SearchFromFree: Adversarial Measurements for Machine Learning-based Energy Theft DetectionJiangnan Li, Yingyuan Yang, Jinyuan Stella Sun
Energy theft causes large economic losses to utility companies around the world. In recent years, energy theft detection approaches based on machine learning (ML) techniques, especially neural networks, become popular in the research literature and achieve state-of-the-art detection performance. However, in this work, we demonstrate that the well-perform ML models for energy theft detection are highly vulnerable to adversarial attacks. In particular, we design an adversarial measurement generation algorithm that enables the attacker to report extremely low power consumption measurements to the utilities while bypassing the ML energy theft detection. We evaluate our approach with three kinds of neural networks based on a real-world smart meter dataset. The evaluation result demonstrates that our approach can significantly decrease the ML models' detection accuracy, even for black-box attackers.
CRMar 12, 2020
ConAML: Constrained Adversarial Machine Learning for Cyber-Physical SystemsJiangnan Li, Yingyuan Yang, Jinyuan Stella Sun et al.
Recent research demonstrated that the superficially well-trained machine learning (ML) models are highly vulnerable to adversarial examples. As ML techniques are becoming a popular solution for cyber-physical systems (CPSs) applications in research literatures, the security of these applications is of concern. However, current studies on adversarial machine learning (AML) mainly focus on pure cyberspace domains. The risks the adversarial examples can bring to the CPS applications have not been well investigated. In particular, due to the distributed property of data sources and the inherent physical constraints imposed by CPSs, the widely-used threat models and the state-of-the-art AML algorithms in previous cyberspace research become infeasible. We study the potential vulnerabilities of ML applied in CPSs by proposing Constrained Adversarial Machine Learning (ConAML), which generates adversarial examples that satisfy the intrinsic constraints of the physical systems. We first summarize the difference between AML in CPSs and AML in existing cyberspace systems and propose a general threat model for ConAML. We then design a best-effort search algorithm to iteratively generate adversarial examples with linear physical constraints. We evaluate our algorithms with simulations of two typical CPSs, the power grids and the water treatment system. The results show that our ConAML algorithms can effectively generate adversarial examples which significantly decrease the performance of the ML models even under practical constraints.
MMMay 15, 2019
SmartBullets: A Cloud-Assisted Bullet Screen Filter based on Deep LearningHaoran Niu, Jiangnan Li, Yu Zhao
Bullet-screen is a technique that enables the website users to send real-time comment `bullet' cross the screen. Compared with the traditional review of a video, bullet-screen provides new features of feeling expression to video watching and more iterations between video viewers. However, since all the comments from the viewers are shown on the screen publicly and simultaneously, some low-quality bullets will reduce the watching enjoyment of the users. Although the bullet-screen video websites have provided filter functions based on regular expression, bad bullets can still easily pass the filter through making a small modification. In this paper, we present SmartBullets, a user-centered bullet-screen filter based on deep learning techniques. A convolutional neural network is trained as the classifier to determine whether a bullet need to be removed according to its quality. Moreover, to increase the scalability of the filter, we employ a cloud-assisted framework by developing a backend cloud server and a front-end browser extension. The evaluation of 40 volunteers shows that SmartBullets can effectively remove the low-quality bullets and improve the overall watching experience of viewers.
CRAug 2, 2018
A Practical Searchable Symmetric Encryption Scheme for Smart Grid DataJiangnan Li, Xiangyu Niu, Jinyuan Stella Sun
Outsourcing data storage to the remote cloud can be an economical solution to enhance data management in the smart grid ecosystem. To protect the privacy of data, the utility company may choose to encrypt the data before uploading them to the cloud. However, while encryption provides confidentiality to data, it also sacrifices the data owners' ability to query a special segment in their data. Searchable symmetric encryption is a technology that enables users to store documents in ciphertext form while keeping the functionality to search keywords in the documents. However, most state-of-the-art SSE algorithms are only focusing on general document storage, which may become unsuitable for smart grid applications. In this paper, we propose a simple, practical SSE scheme that aims to protect the privacy of data generated in the smart grid. Our scheme achieves high space complexity with small information disclosure that was acceptable for practical smart grid application. We also implement a prototype over the statistical data of advanced meter infrastructure to show the effectiveness of our approach.