CVJun 3, 2023Code
Large, Complex, and Realistic Safety Clothing and Helmet Detection: Dataset and MethodFusheng Yu, Jiang Li, Xiaoping Wang et al.
Detecting safety clothing and helmets is paramount for ensuring the safety of construction workers. However, the development of deep learning models in this domain has been impeded by the scarcity of high-quality datasets. In this study, we construct a large, complex, and realistic safety clothing and helmet detection (SFCHD) dataset. SFCHD is derived from two authentic chemical plants, comprising 12,373 images, 7 categories, and 50,552 annotations. We partition the SFCHD dataset into training and testing sets with a ratio of 4:1 and validate its utility by applying several classic object detection algorithms. Furthermore, drawing inspiration from spatial and channel attention mechanisms, we design a spatial and channel attention-based low-light enhancement (SCALE) module. SCALE is a plug-and-play component with a high degree of flexibility. Extensive evaluations of the SCALE module on both the ExDark and SFCHD datasets have empirically demonstrated its efficacy in enhancing the performance of detectors under low-light conditions. The dataset and code are publicly available at https://github.com/lijfrank-open/SFCHD-SCALE.
CLJul 6, 2022
GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion RecognitionJiang Li, Xiaoping Wang, Guoqing Lv et al.
Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and diverse information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.
CLJul 28, 2023
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion RecognitionJiang Li, Xiaoping Wang, Yingjian Liu et al.
Multimodal emotion recognition in conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information in these modalities, rendering it hard to adequately extract complementary information from multimodal data. To cope with this problem, in CFN-ESA, we treat textual modality as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture emotion-shift information, thereby guiding the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform state-of-the-art models.
CLMar 20, 2023
EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversationYingjian Liu, Jiang Li, Xiaoping Wang et al.
Emotion Recognition in Conversation (ERC) has attracted growing attention in recent years as a result of the advancement and implementation of human-computer interface technologies. In this paper, we propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU), and Skip-chain Conditional Random Field (SkipCRF). Compared to previous ERC models, EmotionIC can model a conversation more thoroughly at both the feature-extraction and classification levels. The proposed model attempts to integrate the advantages of attention- and recurrence-based methods at the feature-extraction level. Specifically, IMMHA is applied to capture identity-based global contextual dependencies, while DiaGRU is utilized to extract speaker- and temporal-aware local contextual information. At the classification level, SkipCRF can explicitly mine complex emotional flows from higher-order neighboring utterances in the conversation. Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets. The ablation studies confirm that our modules can effectively model emotional inertia and contagion.
CLJul 31, 2024
Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion RecognitionJiang Li, Xiaoping Wang, Zhigang Zeng
Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model's ability to distinguish sentimental discrepancies. GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), thus enabling simultaneous execution of MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.
CLAug 12, 2023
ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategyJiang Li, Xiaoping Wang, Yingjian Liu et al.
Emotion recognition in conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in limited performance gains. In this paper, we propose a novel emotion recognition network based on curriculum learning strategy (ERNetCL). The proposed ERNetCL primarily consists of temporal encoder (TE), spatial encoder (SE), and curriculum learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To ease the harmful influence resulting from emotion shift and simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.
CLJul 2, 2023
A Dual-Stream Recurrence-Attention Network With Global-Local Awareness for Emotion Recognition in Textual DialogJiang Li, Xiaoping Wang, Zhigang Zeng
In real-world dialog systems, the ability to understand the user's emotions and interact anthropomorphically is of great significance. Emotion Recognition in Conversation (ERC) is one of the key ways to accomplish this goal and has attracted growing attention. How to model the context in a conversation is a central aspect and a major challenge of ERC tasks. Most existing approaches struggle to adequately incorporate both global and local contextual information, and their network structures are overly sophisticated. For this reason, we propose a simple and effective Dual-stream Recurrence-Attention Network (DualRAN), which is based on Recurrent Neural Network (RNN) and Multi-head ATtention network (MAT). DualRAN eschews the complex components of current methods and focuses on combining recurrence-based methods with attention-based ones. DualRAN is a dual-stream structure mainly consisting of local- and global-aware modules, modeling a conversation simultaneously from distinct perspectives. In addition, we develop two single-stream network variants for DualRAN, i.e., SingleRANv1 and SingleRANv2. According to the experimental findings, DualRAN boosts the weighted F1 scores by 1.43% and 0.64% on the IEMOCAP and MELD datasets, respectively, in comparison to the strongest baseline. On two other datasets (i.e., EmoryNLP and DailyDialog), our method also attains competitive results.
CLSep 18, 2023
Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion DisentanglementShanglin Lei, Xiaoping Wang, Guanting Dong et al.
Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{\rm EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.
AIFeb 18
NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron CrystallographyZhongcan Xiao, Leyi Zhang, Guannan Zhang et al.
Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a governed, tool-using AI workflow for TOPAZ at the Spallation Neutron Source that takes instrument data products through reduction, integration, refinement, and validation to a validated crystal structure and a publication-ready CIF. NeuDiff Agent executes this established pipeline under explicit governance by restricting actions to allowlisted tools, enforcing fail-closed verification gates at key workflow boundaries, and capturing complete provenance for inspection, auditing, and controlled replay. Performance is assessed using a fixed prompt protocol and repeated end-to-end runs with two large language model backends, with user and machine time partitioned and intervention burden and recovery behaviors quantified under gating. In a reference-case benchmark, NeuDiff Agent reduces wall time from 435 minutes (manual) to 86.5(4.7) to 94.4(3.5) minutes (4.6-5.0x faster) while producing a validated CIF with no checkCIF level A or B alerts. These results establish a practical route to deploy agentic AI in facility crystallography while preserving traceability and publication-facing validation requirements.
CLDec 13, 2022
InferEM: Inferring the Speaker's Intention for Empathetic Dialogue GenerationGuoqing Lv, Jiang Li, Xiaoping Wang et al.
Current approaches to empathetic response generation typically encode the entire dialogue history directly and put the output into a decoder to generate friendly feedback. These methods focus on modelling contextual information but neglect capturing the direct intention of the speaker. We argue that the last utterance in the dialogue empirically conveys the intention of the speaker. Consequently, we propose a novel model named InferEM for empathetic response generation. We separately encode the last utterance and fuse it with the entire dialogue through the multi-head attention based intention fusion module to capture the speaker's intention. Besides, we utilize previous utterances to predict the last utterance, which simulates human's psychology to guess what the interlocutor may speak in advance. To balance the optimizing rates of the utterance prediction and response generation, a multi-task learning strategy is designed for InferEM. Experimental results demonstrate the plausibility and validity of InferEM in improving empathetic expression.
CLSep 21, 2023
InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language ModelsShanglin Lei, Guanting Dong, Xiaoping Wang et al.
The field of emotion recognition of conversation (ERC) has been focusing on separating sentence feature encoding and context modeling, lacking exploration in generative paradigms based on unified designs. In this study, we propose a novel approach, InstructERC, to reformulate the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs). InstructERC makes three significant contributions: (1) it introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information. (2) We introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations. (3) Pioneeringly, we unify emotion labels across benchmarks through the feeling wheel to fit real application scenarios. InstructERC still perform impressively on this unified dataset. Our LLM-based plugin framework significantly outperforms all previous models and achieves comprehensive SOTA on three commonly used ERC datasets. Extensive analysis of parameter-efficient and data-scaling experiments provides empirical guidance for applying it in practical scenarios.
LGMar 6, 2025Code
Joint Masked Reconstruction and Contrastive Learning for Mining Interactions Between ProteinsJiang Li, Xiaoping Wang
Protein-protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations, holding significant practical implications for the realms of pharmaceutical development and clinical treatment. Presently, the majority of research methods primarily concentrate on the analysis of amino acid sequences, while investigations predicated on protein structures remain in the nascent stages of exploration. Despite the emergence of several structure-based algorithms in recent years, these are still confronted with inherent challenges: (1) the extraction of intrinsic structural information of proteins typically necessitates the expenditure of substantial computational resources; (2) these models are overly reliant on seen protein data, struggling to effectively unearth interaction cues between unknown proteins. To further propel advancements in this domain, this paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI. This methodology dissects the PPI prediction task into two distinct phases: during the residue structure encoding phase, JmcPPI devises two feature reconstruction tasks and employs graph attention mechanism to capture structural information between residues; during the protein interaction inference phase, JmcPPI perturbs the original PPI graph and employs a multi-graph contrastive learning strategy to thoroughly mine extrinsic interaction information of novel proteins. Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models across various data partition schemes. The associated code can be accessed via https://github.com/lijfrank-open/JmcPPI.
IRJan 10, 2022
Tree-based Search Graph for Approximate Nearest Neighbor SearchXiaobin Fan, Xiaoping Wang, Kai Lu et al.
Nearest neighbor search supports important applications in many domains, such as database, machine learning, computer vision. Since the computational cost for accurate search is too high, the community turned to the research of approximate nearest neighbor search (ANNS). Among them, graph-based algorithm is one of the most important branches. Research by Fu et al. shows that the algorithms based on Monotonic Search Network (MSNET), such as NSG and NSSG, have achieved the state-of-the-art search performance in efficiency. The MSNET is dedicated to achieving monotonic search with minimal out-degree of nodes to pursue high efficiency. However, the current MSNET designs did not optimize the probability of the monotonic search, and the lower bound of the probability is only 50%. If they fail in monotonic search stage, they have to suffer tremendous backtracking cost to achieve the required accuracy. This will cause performance problems in search efficiency. To address this problem, we propose (r,p)-MSNET, which achieves guaranteed probability on monotonic search. Due to the high building complexity of a strict (r,p)-MSNET, we propose TBSG, which is an approximation with low complexity. Experiment conducted on four million-scaled datasets show that TBSG outperforms existing state-of-the-art graph-based algorithms in search efficiency. Our code has been released on Github.
LGJan 10, 2022
Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment AnalysisJiahao Zheng, Sen Zhang, Xiaoping Wang et al.
Multimodal sentiment analysis (MSA) is a fundamental complex research problem due to the heterogeneity gap between different modalities and the ambiguity of human emotional expression. Although there have been many successful attempts to construct multimodal representations for MSA, there are still two challenges to be addressed: 1) A more robust multimodal representation needs to be constructed to bridge the heterogeneity gap and cope with the complex multimodal interactions, and 2) the contextual dynamics must be modeled effectively throughout the information flow. In this work, we propose a multimodal representation model based on Mutual information Maximization and Minimization and Identity Embedding (MMMIE). We combine mutual information maximization between modal pairs, and mutual information minimization between input data and corresponding features to mine the modal-invariant and task-related information. Furthermore, Identity Embedding is proposed to prompt the downstream network to perceive the contextual information. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.
CVSep 8, 2019
STA: Adversarial Attacks on Siamese TrackersXugang Wu, Xiaoping Wang, Xu Zhou et al.
Recently, the majority of visual trackers adopt Convolutional Neural Network (CNN) as their backbone to achieve high tracking accuracy. However, less attention has been paid to the potential adversarial threats brought by CNN, including Siamese network. In this paper, we first analyze the existing vulnerabilities in Siamese trackers and propose the requirements for a successful adversarial attack. On this basis, we formulate the adversarial generation problem and propose an end-to-end pipeline to generate a perturbed texture map for the 3D object that causes the trackers to fail. Finally, we conduct thorough experiments to verify the effectiveness of our algorithm. Experiment results show that adversarial examples generated by our algorithm can successfully lower the tracking accuracy of victim trackers and even make them drift off. To the best of our knowledge, this is the first work to generate 3D adversarial examples on visual trackers.
CVJan 31, 2018
Weighted Nonlocal Total Variation in Image ProcessingHaohan Li, Zuoqiang Shi, Xiaoping Wang
In this paper, a novel weighted nonlocal total variation (WNTV) method is proposed. Compared to the classical nonlocal total variation methods, our method modifies the energy functional to introduce a weight to balance between the labeled sets and unlabeled sets. With extensive numerical examples in semi-supervised clustering, image inpainting and image colorization, we demonstrate that WNTV provides an effective and efficient method in many image processing and machine learning problems.
CVAug 4, 2016
An efficient iterative thresholding method for image segmentationDong Wang, Haohan Li, Xiaoyu Wei et al.
We proposed an efficient iterative thresholding method for multi-phase image segmentation. The algorithm is based on minimizing piecewise constant Mumford-Shah functional in which the contour length (or perimeter) is approximated by a non-local multi-phase energy. The minimization problem is solved by an iterative method. Each iteration consists of computing simple convolutions followed by a thresholding step. The algorithm is easy to implement and has the optimal complexity $O(N \log N)$ per iteration. We also show that the iterative algorithm has the total energy decaying property. We present some numerical results to show the efficiency of our method.