Zhigang Zeng

LG
h-index22
29papers
621citations
Novelty51%
AI Score59

29 Papers

SYDec 29, 2022
Federated Multi-Agent Deep Reinforcement Learning Approach via Physics-Informed Reward for Multi-Microgrid Energy Management

Yuanzheng Li, Shangyang He, Yang Li et al.

The utilization of large-scale distributed renewable energy promotes the development of the multi-microgrid (MMG), which raises the need of developing an effective energy management method to minimize economic costs and keep self energy-sufficiency. The multi-agent deep reinforcement learning (MADRL) has been widely used for the energy management problem because of its real-time scheduling ability. However, its training requires massive energy operation data of microgrids (MGs), while gathering these data from different MGs would threaten their privacy and data security. Therefore, this paper tackles this practical yet challenging issue by proposing a federated multi-agent deep reinforcement learning (F-MADRL) algorithm via the physics-informed reward. In this algorithm, the federated learning (FL) mechanism is introduced to train the F-MADRL algorithm thus ensures the privacy and the security of data. In addition, a decentralized MMG model is built, and the energy of each participated MG is managed by an agent, which aims to minimize economic costs and keep self energy-sufficiency according to the physics-informed reward. At first, MGs individually execute the self-training based on local energy operation data to train their local agent models. Then, these local models are periodically uploaded to a server and their parameters are aggregated to build a global agent, which will be broadcasted to MGs and replace their local agents. In this way, the experience of each MG agent can be shared and the energy operation data is not explicitly transmitted, thus protecting the privacy and ensuring data security. Finally, experiments are conducted on Oak Ridge national laboratory distributed energy control communication lab microgrid (ORNL-MG) test system, and the comparisons are carried out to verify the effectiveness of introducing the FL mechanism and the outperformance of our proposed F-MADRL.

LGJun 18, 2023
IF2Net: Innately Forgetting-Free Networks for Continual Learning

Depeng Li, Tianqi Wang, Bingrong Xu et al.

Continual learning can incrementally absorb new concepts without interfering with previously learned knowledge. Motivated by the characteristics of neural networks, in which information is stored in weights on connections, we investigated how to design an Innately Forgetting-Free Network (IF2Net) for continual learning context. This study proposed a straightforward yet effective learning paradigm by ingeniously keeping the weights relative to each seen task untouched before and after learning a new task. We first presented the novel representation-level learning on task sequences with random weights. This technique refers to tweaking the drifted representations caused by randomization back to their separate task-optimal working states, but the involved weights are frozen and reused (opposite to well-known layer-wise updates of weights). Then, sequential decision-making without forgetting can be achieved by projecting the output weight updates into the parsimonious orthogonal space, making the adaptations not disturb old knowledge while maintaining model plasticity. IF2Net allows a single network to inherently learn unlimited mapping rules without telling task identities at test time by integrating the respective strengths of randomization and orthogonalization. We validated the effectiveness of our approach in the extensive theoretical analysis and empirical study.

CVJun 3, 2023Code
Large, Complex, and Realistic Safety Clothing and Helmet Detection: Dataset and Method

Fusheng Yu, Jiang Li, Xiaoping Wang et al.

Detecting safety clothing and helmets is paramount for ensuring the safety of construction workers. However, the development of deep learning models in this domain has been impeded by the scarcity of high-quality datasets. In this study, we construct a large, complex, and realistic safety clothing and helmet detection (SFCHD) dataset. SFCHD is derived from two authentic chemical plants, comprising 12,373 images, 7 categories, and 50,552 annotations. We partition the SFCHD dataset into training and testing sets with a ratio of 4:1 and validate its utility by applying several classic object detection algorithms. Furthermore, drawing inspiration from spatial and channel attention mechanisms, we design a spatial and channel attention-based low-light enhancement (SCALE) module. SCALE is a plug-and-play component with a high degree of flexibility. Extensive evaluations of the SCALE module on both the ExDark and SFCHD datasets have empirically demonstrated its efficacy in enhancing the performance of detectors under low-light conditions. The dataset and code are publicly available at https://github.com/lijfrank-open/SFCHD-SCALE.

SDMay 18Code
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Kaiwen Luo, Zhenhong Zhou, Leo Wang et al.

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

CLJul 6, 2022
GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition

Jiang Li, Xiaoping Wang, Guoqing Lv et al.

Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and diverse information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.

LGJun 16, 2023
Multi-View Class Incremental Learning

Depeng Li, Tianqi Wang, Junwei Chen et al.

Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance. To make MVL methods more practical in an open-ended environment, this paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views, requiring no access to earlier views of data. However, MVCIL is challenged by the catastrophic forgetting of old information and the interference with learning new concepts. To address this, we first develop a randomization-based representation learning technique serving for feature extraction to guarantee their separate view-optimal working states, during which multiple views belonging to a class are presented sequentially; Then, we integrate them one by one in the orthogonality fusion subspace spanned by the extracted features; Finally, we introduce selective weight consolidation for learning-without-forgetting decision-making while encountering new classes. Extensive experiments on synthetic and real-world datasets validate the effectiveness of our approach.

CLJul 28, 2023
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition

Jiang Li, Xiaoping Wang, Yingjian Liu et al.

Multimodal emotion recognition in conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information in these modalities, rendering it hard to adequately extract complementary information from multimodal data. To cope with this problem, in CFN-ESA, we treat textual modality as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture emotion-shift information, thereby guiding the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform state-of-the-art models.

CLMar 20, 2023
EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation

Yingjian Liu, Jiang Li, Xiaoping Wang et al.

Emotion Recognition in Conversation (ERC) has attracted growing attention in recent years as a result of the advancement and implementation of human-computer interface technologies. In this paper, we propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU), and Skip-chain Conditional Random Field (SkipCRF). Compared to previous ERC models, EmotionIC can model a conversation more thoroughly at both the feature-extraction and classification levels. The proposed model attempts to integrate the advantages of attention- and recurrence-based methods at the feature-extraction level. Specifically, IMMHA is applied to capture identity-based global contextual dependencies, while DiaGRU is utilized to extract speaker- and temporal-aware local contextual information. At the classification level, SkipCRF can explicitly mine complex emotional flows from higher-order neighboring utterances in the conversation. Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets. The ablation studies confirm that our modules can effectively model emotional inertia and contagion.

CLJul 31, 2024
Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition

Jiang Li, Xiaoping Wang, Zhigang Zeng

Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model's ability to distinguish sentimental discrepancies. GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), thus enabling simultaneous execution of MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.

LGMar 18Code
SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

Haolong Hu, Hanyu Li, Tiancheng He et al.

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 -> 81.84/70.77 for 3B; 56.21/60.32 -> 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 -> 55.58/70.27 for 3B; 24.66/46.48 -> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer

CLAug 12, 2023
ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Jiang Li, Xiaoping Wang, Yingjian Liu et al.

Emotion recognition in conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in limited performance gains. In this paper, we propose a novel emotion recognition network based on curriculum learning strategy (ERNetCL). The proposed ERNetCL primarily consists of temporal encoder (TE), spatial encoder (SE), and curriculum learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To ease the harmful influence resulting from emotion shift and simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.

CLJul 2, 2023
A Dual-Stream Recurrence-Attention Network With Global-Local Awareness for Emotion Recognition in Textual Dialog

Jiang Li, Xiaoping Wang, Zhigang Zeng

In real-world dialog systems, the ability to understand the user's emotions and interact anthropomorphically is of great significance. Emotion Recognition in Conversation (ERC) is one of the key ways to accomplish this goal and has attracted growing attention. How to model the context in a conversation is a central aspect and a major challenge of ERC tasks. Most existing approaches struggle to adequately incorporate both global and local contextual information, and their network structures are overly sophisticated. For this reason, we propose a simple and effective Dual-stream Recurrence-Attention Network (DualRAN), which is based on Recurrent Neural Network (RNN) and Multi-head ATtention network (MAT). DualRAN eschews the complex components of current methods and focuses on combining recurrence-based methods with attention-based ones. DualRAN is a dual-stream structure mainly consisting of local- and global-aware modules, modeling a conversation simultaneously from distinct perspectives. In addition, we develop two single-stream network variants for DualRAN, i.e., SingleRANv1 and SingleRANv2. According to the experimental findings, DualRAN boosts the weighted F1 scores by 1.43% and 0.64% on the IEMOCAP and MELD datasets, respectively, in comparison to the strongest baseline. On two other datasets (i.e., EmoryNLP and DailyDialog), our method also attains competitive results.

LGMar 3Code
SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi et al.

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

CLDec 13, 2022
InferEM: Inferring the Speaker's Intention for Empathetic Dialogue Generation

Guoqing Lv, Jiang Li, Xiaoping Wang et al.

Current approaches to empathetic response generation typically encode the entire dialogue history directly and put the output into a decoder to generate friendly feedback. These methods focus on modelling contextual information but neglect capturing the direct intention of the speaker. We argue that the last utterance in the dialogue empirically conveys the intention of the speaker. Consequently, we propose a novel model named InferEM for empathetic response generation. We separately encode the last utterance and fuse it with the entire dialogue through the multi-head attention based intention fusion module to capture the speaker's intention. Besides, we utilize previous utterances to predict the last utterance, which simulates human's psychology to guess what the interlocutor may speak in advance. To balance the optimizing rates of the utterance prediction and response generation, a multi-task learning strategy is designed for InferEM. Experimental results demonstrate the plausibility and validity of InferEM in improving empathetic expression.

LGJun 21, 2023
Complementary Learning Subnetworks for Parameter-Efficient Class-Incremental Learning

Depeng Li, Zhigang Zeng

In the scenario of class-incremental learning (CIL), deep neural networks have to adapt their model parameters to non-stationary data distributions, e.g., the emergence of new classes over time. However, CIL models are challenged by the well-known catastrophic forgetting phenomenon. Typical methods such as rehearsal-based ones rely on storing exemplars of old classes to mitigate catastrophic forgetting, which limits real-world applications considering memory resources and privacy issues. In this paper, we propose a novel rehearsal-free CIL approach that learns continually via the synergy between two Complementary Learning Subnetworks. Our approach involves jointly optimizing a plastic CNN feature extractor and an analytical feed-forward classifier. The inaccessibility of historical data is tackled by holistically controlling the parameters of a well-trained model, ensuring that the decision boundary learned fits new classes while retaining recognition of previously learned classes. Specifically, the trainable CNN feature extractor provides task-dependent knowledge separately without interference; and the final classifier integrates task-specific knowledge incrementally for decision-making without forgetting. In each CIL session, it accommodates new tasks by attaching a tiny set of declarative parameters to its backbone, in which only one matrix per task or one vector per class is kept for knowledge retention. Extensive experiments on a variety of task sequences show that our method achieves competitive results against state-of-the-art methods, especially in accuracy gain, memory cost, training efficiency, and task-order robustness. Furthermore, to make the non-growing backbone (i.e., a model with limited network capacity) suffice to train on more incoming tasks, a graceful forgetting implementation on previously learned trivial tasks is empirically investigated.

AIMay 12
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang, Liang Lin, Yuer Li et al.

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.

LGSep 18, 2025Code
HDC-X: Efficient Medical Data Classification for Embedded Devices

Jianglan Wei, Zhenyu Zhang, Pengcheng Wang et al.

Energy-efficient medical data classification is essential for modern disease screening, particularly in home and field healthcare where embedded devices are prevalent. While deep learning models achieve state-of-the-art accuracy, their substantial energy consumption and reliance on GPUs limit deployment on such platforms. We present HDC-X, a lightweight classification framework designed for low-power devices. HDC-X encodes data into high-dimensional hypervectors, aggregates them into multiple cluster-specific prototypes, and performs classification through similarity search in hyperspace. We evaluate HDC-X across three medical classification tasks; on heart sound classification, HDC-X is $350\times$ more energy-efficient than Bayesian ResNet with less than 1% accuracy difference. Moreover, HDC-X demonstrates exceptional robustness to noise, limited training data, and hardware error, supported by both theoretical analysis and empirical results, highlighting its potential for reliable deployment in real-world settings. Code is available at https://github.com/jianglanwei/HDC-X.

IVNov 23, 2024Code
Multi-scale Cascaded Foundation Model for Whole-body Organs-at-risk Segmentation

Rui Hao, Dayu Tan, Qiankun Li et al.

Accurate segmentation of organs-at-risk (OARs) is vital for safe and precise radiotherapy and surgery. Most existing studies segment only a limited set of organs or regions, lacking a systematic treatment of OARs segmentation. We present a Multi-scale Cascaded Fusion Network (MCFNet) that aggregates features across multiple scales and resolutions. MCFNet consists of a Sharp Extraction Backbone for the downsampling path and a Flexible Connection Backbone for skip-connection fusion, strengthening representation learning in both stages. This design improves boundary localization and preserves fine structures while maintaining computational efficiency, enabling reliable performance even on low-resolution inputs. Experiments on an NVIDIA A6000 GPU using 36,131 image-mask pairs from 671 patients across 10 datasets show consistent robustness and strong cross-dataset generalization. An adaptive loss-aggregation strategy further stabilizes optimization and yields additional gains in accuracy and training efficiency. Through extensive validation, MCFNet outperforms existing methods, excelling in organ segmentation and providing reliable image-guided support for computer-aided diagnosis. Our solution aims to improve the precision and safety of radiotherapy and surgery while supporting personalized treatment, advancing modern medical technology. The code has been made available on GitHub: https://github.com/Henry991115/MCFNet.

CLJan 2
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu et al.

Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

AIMar 10
PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

Jinyue Li, Yuci Liang, Qiankun Li et al.

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.

AIFeb 16
Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning

Jing Tang, Kun Wang, Haolang Lu et al.

Multimodal large language models (MLLMs) in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model's implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures.

AIMar 8, 2025
Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization

Wei Liu, Zhiying Deng, Zhongyu Niu et al.

Extracting a small subset of crucial rationales from the full input is a key problem in explainability research. The most widely used fundamental criterion for rationale extraction is the maximum mutual information (MMI) criterion. In this paper, we first demonstrate that MMI suffers from diminishing marginal returns. Once part of the rationale has been identified, finding the remaining portions contributes only marginally to increasing the mutual information, making it difficult to use MMI to locate the rest. In contrast to MMI that aims to reproduce the prediction, we seek to identify the parts of the input that the network can actually utilize. This is achieved by comparing how different rationale candidates match the capability space of the weight matrix. The weight matrix of a neural network is typically low-rank, meaning that the linear combinations of its column vectors can only cover part of the directions in a high-dimensional space (high-dimension: the dimensions of an input vector). If an input is fully utilized by the network, {it generally matches these directions (e.g., a portion of a hypersphere), resulting in a representation with a high norm. Conversely, if an input primarily falls outside (orthogonal to) these directions}, its representation norm will approach zero, behaving like noise that the network cannot effectively utilize. Building on this, we propose using the norms of rationale candidates as an alternative objective to MMI. Through experiments on four text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our method outperforms MMI and its improved variants in identifying better rationales. We also compare our method with a representative LLM (llama-3.1-8b-instruct) and find that our simple method gets comparable results to it and can sometimes even outperform it.

LGJan 17, 2024
Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding

Depeng Li, Tianqi Wang, Junwei Chen et al.

Deep neural networks are susceptible to catastrophic forgetting when trained on sequential tasks. Various continual learning (CL) methods often rely on exemplar buffers or/and network expansion for balancing model stability and plasticity, which, however, compromises their practical value due to privacy and memory concerns. Instead, this paper considers a strict yet realistic setting, where the training data from previous tasks is unavailable and the model size remains relatively constant during sequential training. To achieve such desiderata, we propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. This is achieved by the synergy between two key components: HSIC-Bottleneck Orthogonalization (HBO) implements non-overwritten parameter updates mediated by Hilbert-Schmidt independence criterion in an orthogonal space and EquiAngular Embedding (EAE) enhances decision boundary adaptation between old and new tasks with predefined basis vectors. Extensive experiments demonstrate that our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.

CLMar 7
Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Junming Liu, Yuqi Li, Shiping Wen et al.

Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.

LGJun 4, 2024
Harnessing Neural Unit Dynamics for Effective and Scalable Class-Incremental Learning

Depeng Li, Tianqi Wang, Junwei Chen et al.

Class-incremental learning (CIL) aims to train a model to learn new classes from non-stationary data streams without forgetting old ones. In this paper, we propose a new kind of connectionist model by tailoring neural unit dynamics that adapt the behavior of neural networks for CIL. In each training session, it introduces a supervisory mechanism to guide network expansion whose growth size is compactly commensurate with the intrinsic complexity of a newly arriving task. This constructs a near-minimal network while allowing the model to expand its capacity when cannot sufficiently hold new classes. At inference time, it automatically reactivates the required neural units to retrieve knowledge and leaves the remaining inactivated to prevent interference. We name our model AutoActivator, which is effective and scalable. To gain insights into the neural unit dynamics, we theoretically analyze the model's convergence property via a universal approximation theorem on learning sequential mappings, which is under-explored in the CIL community. Experiments show that our method achieves strong CIL performance in rehearsal-free and minimal-expansion settings with different backbones.

LGJan 10, 2022
Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis

Jiahao Zheng, Sen Zhang, Xiaoping Wang et al.

Multimodal sentiment analysis (MSA) is a fundamental complex research problem due to the heterogeneity gap between different modalities and the ambiguity of human emotional expression. Although there have been many successful attempts to construct multimodal representations for MSA, there are still two challenges to be addressed: 1) A more robust multimodal representation needs to be constructed to bridge the heterogeneity gap and cope with the complex multimodal interactions, and 2) the contextual dynamics must be modeled effectively throughout the information flow. In this work, we propose a multimodal representation model based on Mutual information Maximization and Minimization and Identity Embedding (MMMIE). We combine mutual information maximization between modal pairs, and mutual information minimization between input data and corresponding features to mine the modal-invariant and task-related information. Furthermore, Identity Embedding is proposed to prompt the downstream network to perceive the contextual information. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.

CROct 30, 2020
EEG-Based Brain-Computer Interfaces Are Vulnerable to Backdoor Attacks

Lubin Meng, Jian Huang, Zhigang Zeng et al.

Research and development of electroencephalogram (EEG) based brain-computer interfaces (BCIs) have advanced rapidly, partly due to deeper understanding of the brain and wide adoption of sophisticated machine learning approaches for decoding the EEG signals. However, recent studies have shown that machine learning algorithms are vulnerable to adversarial attacks. This article proposes to use narrow period pulse for poisoning attack of EEG-based BCIs, which is implementable in practice and has never been considered before. One can create dangerous backdoors in the machine learning model by injecting poisoning samples into the training set. Test samples with the backdoor key will then be classified into the target class specified by the attacker. What most distinguishes our approach from previous ones is that the backdoor key does not need to be synchronized with the EEG trials, making it very easy to implement. The effectiveness and robustness of the backdoor attack approach is demonstrated, highlighting a critical security concern for EEG-based BCIs and calling for urgent attention to address it.

SPJul 3, 2020
Transfer Learning for Motor Imagery Based Brain-Computer Interfaces: A Complete Pipeline

Dongrui Wu, Xue Jiang, Ruimin Peng et al.

Transfer learning (TL) has been widely used in motor imagery (MI) based brain-computer interfaces (BCIs) to reduce the calibration effort for a new subject, and demonstrated promising performance. While a closed-loop MI-based BCI system, after electroencephalogram (EEG) signal acquisition and temporal filtering, includes spatial filtering, feature engineering, and classification blocks before sending out the control signal to an external device, previous approaches only considered TL in one or two such components. This paper proposes that TL could be considered in all three components (spatial filtering, feature engineering, and classification) of MI-based BCIs. Furthermore, it is also very important to specifically add a data alignment component before spatial filtering to make the data from different subjects more consistent, and hence to facilitate subsequential TL. Offline calibration experiments on two MI datasets verified our proposal. Especially, integrating data alignment and sophisticated TL approaches can significantly improve the classification performance, and hence greatly reduces the calibration effort.

LGMar 25, 2019
On the Functional Equivalence of TSK Fuzzy Systems to Neural Networks, Mixture of Experts, CART, and Stacking Ensemble Regression

Dongrui Wu, Chin-Teng Lin, Jian Huang et al.

Fuzzy systems have achieved great success in numerous applications. However, there are still many challenges in designing an optimal fuzzy system, e.g., how to efficiently optimize its parameters, how to balance the trade-off between cooperations and competitions among the rules, how to overcome the curse of dimensionality, how to increase its generalization ability, etc. Literature has shown that by making appropriate connections between fuzzy systems and other machine learning approaches, good practices from other domains may be used to improve the fuzzy systems, and vice versa. This paper gives an overview on the functional equivalence between Takagi-Sugeno-Kang fuzzy systems and four classic machine learning approaches -- neural networks, mixture of experts, classification and regression trees, and stacking ensemble regression -- for regression problems. We also point out some promising new research directions, inspired by the functional equivalence, that could lead to solutions to the aforementioned problems. To our knowledge, this is so far the most comprehensive overview on the connections between fuzzy systems and other popular machine learning approaches, and hopefully will stimulate more hybridization between different machine learning algorithms.