Ming Zhao

CV
h-index24
49papers
2,670citations
Novelty51%
AI Score58

49 Papers

ROMay 27, 2022Code
OpenCalib: A Multi-sensor Calibration Toolbox for Autonomous Driving

Guohang Yan, Liu Zhuochun, Chengjie Wang et al. · stanford

Accurate sensor calibration is a prerequisite for multi-sensor perception and localization systems for autonomous vehicles. The intrinsic parameter calibration of the sensor is to obtain the mapping relationship inside the sensor, and the extrinsic parameter calibration is to transform two or more sensors into a unified spatial coordinate system. Most sensors need to be calibrated after installation to ensure the accuracy of sensor measurements. To this end, we present OpenCalib, a calibration toolbox that contains a rich set of various sensor calibration methods. OpenCalib covers manual calibration tools, automatic calibration tools, factory calibration tools, and online calibration tools for different application scenarios. At the same time, to evaluate the calibration accuracy and subsequently improve the accuracy of the calibration algorithm, we released a corresponding benchmark dataset. This paper introduces various features and calibration methods of this toolbox. To our knowledge, this is the first open-sourced calibration codebase containing the full set of autonomous-driving-related calibration approaches in this area. We wish that the toolbox could be helpful to autonomous driving researchers. We have open-sourced our code on GitHub to benefit the community. Code is available at https://github.com/PJLab-ADG/SensorsCalibration.

53.7AIMay 30
PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

Zongzong Wu, Ming Zhao, Fengxiao Tang et al.

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

NIDec 5, 2022
Differentiated Federated Reinforcement Learning Based Traffic Offloading on Space-Air-Ground Integrated Networks

Yeguang Qin, Yilin Yang, Fengxiao Tang et al. · mila

The Space-Air-Ground Integrated Network (SAGIN) plays a pivotal role as a comprehensive foundational network communication infrastructure, presenting opportunities for highly efficient global data transmission. Nonetheless, given SAGIN's unique characteristics as a dynamically heterogeneous network, conventional network optimization methodologies encounter challenges in satisfying the stringent requirements for network latency and stability inherent to data transmission within this network environment. Therefore, this paper proposes the use of differentiated federated reinforcement learning (DFRL) to solve the traffic offloading problem in SAGIN, i.e., using multiple agents to generate differentiated traffic offloading policies. Considering the differentiated characteristics of each region of SAGIN, DFRL models the traffic offloading policy optimization process as the process of solving the Decentralized Partially Observable Markov Decision Process (DEC-POMDP) problem. The paper proposes a novel Differentiated Federated Soft Actor-Critic (DFSAC) algorithm to solve the problem. The DFSAC algorithm takes the network packet delay as the joint reward value and introduces the global trend model as the joint target action-value function of each agent to guide the update of each agent's policy. The simulation results demonstrate that the traffic offloading policy based on the DFSAC algorithm achieves better performance in terms of network throughput, packet loss rate, and packet delay compared to the traditional federated reinforcement learning approach and other baseline approaches.

CLMar 25, 2022
EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition

Zaijing Li, Fengxiao Tang, Ming Zhao et al.

Emotion recognition in conversation (ERC) aims to analyze the speaker's state and identify their emotion in the conversation. Recent works in ERC focus on context modeling but ignore the representation of contextual emotional tendency. In order to extract multi-modal information and the emotional tendency of the utterance effectively, we propose a new structure named Emoformer to extract multi-modal emotion vectors from different modalities and fuse them with sentence vector to be an emotion capsule. Furthermore, we design an end-to-end ERC model called EmoCaps, which extracts emotion vectors through the Emoformer structure and obtain the emotion classification results from a context analysis model. Through the experiments with two benchmark datasets, our model shows better performance than the existing state-of-the-art models.

LGMar 14, 2023Code
Automatic Attention Pruning: Improving and Automating Model Pruning using Attentions

Kaiqi Zhao, Animesh Jain, Ming Zhao

Pruning is a promising approach to compress deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yields models that cannot efficiently run on commodity hardware; and they often require users to manually explore and tune the pruning process, which is time-consuming and often leads to sub-optimal results. To address these limitations, this paper presents Automatic Attention Pruning (AAP), an adaptive, attention-based, structured pruning approach to automatically generate small, accurate, and hardware-efficient models that meet user objectives. First, it proposes iterative structured pruning using activation-based attention maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that AAP substantially outperforms the state-of-the-art structured pruning works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/Automatic-Attention-Pruning.git.

IVApr 2, 2022
Automatic Registration of Images with Inconsistent Content Through Line-Support Region Segmentation and Geometrical Outlier Removal

Ming Zhao, Yongpeng Wu, Shengda Pan et al.

The implementation of automatic image registration is still difficult in various applications. In this paper, an automatic image registration approach through line-support region segmentation and geometrical outlier removal (ALRS-GOR) is proposed. This new approach is designed to address the problems associated with the registration of images with affine deformations and inconsistent content, such as remote sensing images with different spectral content or noise interference, or map images with inconsistent annotations. To begin with, line-support regions, namely a straight region whose points share roughly the same image gradient angle, are extracted to address the issues of inconsistent content existing in images. To alleviate the incompleteness of line segments, an iterative strategy with multi-resolution is employed to preserve global structures that are masked at full resolution by image details or noise. Then, Geometrical Outlier Removal (GOR) is developed to provide reliable feature point matching, which is based on affineinvariant geometrical classifications for corresponding matches initialized by SIFT. The candidate outliers are selected by comparing the disparity of accumulated classifications among all matches, instead of conventional methods which only rely on local geometrical relations. Various image sets have been considered in this paper for the evaluation of the proposed approach, including aerial images with simulated affine deformations, remote sensing optical and synthetic aperture radar images taken at different situations (multispectral, multisensor, and multitemporal), and map images with inconsistent annotations. Experimental results demonstrate the superior performance of the proposed method over the existing approaches for the whole data set.

CVSep 8, 2022
RGB-X Classification for Electronics Sorting

FNU Abhimanyu, Tejas Zodage, Umesh Thillaivasan et al.

Effectively disassembling and recovering materials from waste electrical and electronic equipment (WEEE) is a critical step in moving global supply chains from carbon-intensive, mined materials to recycled and renewable ones. Conventional recycling processes rely on shredding and sorting waste streams, but for WEEE, which is comprised of numerous dissimilar materials, we explore targeted disassembly of numerous objects for improved material recovery. Many WEEE objects share many key features and therefore can look quite similar, but their material composition and internal component layout can vary, and thus it is critical to have an accurate classifier for subsequent disassembly steps for accurate material separation and recovery. This work introduces RGB-X, a multi-modal image classification approach, that utilizes key features from external RGB images with those generated from X-ray images to accurately classify electronic objects. More specifically, this work develops Iterative Class Activation Mapping (iCAM), a novel network architecture that explicitly focuses on the finer-details in the multi-modal feature maps that are needed for accurate electronic object classification. In order to train a classifier, electronic objects lack large and well annotated X-ray datasets due to expense and need of expert guidance. To overcome this issue, we present a novel way of creating a synthetic dataset using domain randomization applied to the X-ray domain. The combined RGB-X approach gives us an accuracy of 98.6% on 10 generations of modern smartphones, which is greater than their individual accuracies of 89.1% (RGB) and 97.9% (X-ray) independently. We provide experimental results3 to corroborate our results.

IVApr 2, 2022
RFVTM: A Recovery and Filtering Vertex Trichotomy Matching for Remote Sensing Image Registration

Ming Zhao, Bowen An, Yongpeng Wu et al.

Reliable feature point matching is a vital yet challenging process in feature-based image registration. In this paper,a robust feature point matching algorithm called Recovery and Filtering Vertex Trichotomy Matching (RFVTM) is proposed to remove outliers and retain sufficient inliers for remote sensing images. A novel affine invariant descriptor called vertex trichotomy descriptor is proposed on the basis of that geometrical relations between any of vertices and lines are preserved after affine transformations, which is constructed by mapping each vertex into trichotomy sets. The outlier removals in Vertex Trichotomy Matching (VTM) are implemented by iteratively comparing the disparity of corresponding vertex trichotomy descriptors. Some inliers mistakenly validated by a large amount of outliers are removed in VTM iterations, and several residual outliers close to correct locations cannot be excluded with the same graph structures. Therefore, a recovery and filtering strategy is designed to recover some inliers based on identical vertex trichotomy descriptors and restricted transformation errors. Assisted with the additional recovered inliers, residual outliers can also be filtered out during the process of reaching identical graph for the expanded vertex sets. Experimental results demonstrate the superior performance on precision and stability of this algorithm under various conditions, such as remote sensing images with large transformations, duplicated patterns, or inconsistent spectral content.

AISep 18, 2023
Bias of AI-Generated Content: An Examination of News Produced by Large Language Models

Xiao Fang, Shangkun Che, Minjia Mao et al.

Large language models (LLMs) have the potential to transform our lives and work through the content they generate, known as AI-Generated Content (AIGC). To harness this transformation, we need to understand the limitations of LLMs. Here, we investigate the bias of AIGC produced by seven representative LLMs, including ChatGPT and LLaMA. We collect news articles from The New York Times and Reuters, both known for their dedication to provide unbiased news. We then apply each examined LLM to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the AIGC produced by the LLM by comparing the AIGC and the original news articles. We further analyze the gender bias of each LLM under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. Our study reveals that the AIGC produced by each examined LLM demonstrates substantial gender and racial biases. Moreover, the AIGC generated by each LLM exhibits notable discrimination against females and individuals of the Black race. Among the LLMs, the AIGC generated by ChatGPT demonstrates the lowest level of bias, and ChatGPT is the sole model capable of declining content generation when provided with biased prompts.

CVJun 5, 2023
Confidence-based federated distillation for vision-based lane-centering

Yitao Chen, Dawei Chen, Haoxin Wang et al.

A fundamental challenge of autonomous driving is maintaining the vehicle in the center of the lane by adjusting the steering angle. Recent advances leverage deep neural networks to predict steering decisions directly from images captured by the car cameras. Machine learning-based steering angle prediction needs to consider the vehicle's limitation in uploading large amounts of potentially private data for model training. Federated learning can address these constraints by enabling multiple vehicles to collaboratively train a global model without sharing their private data, but it is difficult to achieve good accuracy as the data distribution is often non-i.i.d. across the vehicles. This paper presents a new confidence-based federated distillation method to improve the performance of federated learning for steering angle prediction. Specifically, it proposes the novel use of entropy to determine the predictive confidence of each local model, and then selects the most confident local model as the teacher to guide the learning of the global model. A comprehensive evaluation of vision-based lane centering shows that the proposed approach can outperform FedAvg and FedDF by 11.3% and 9%, respectively.

43.1NIMar 23
MSADM: Large Language Model (LLM) Assisted End-to-End Network Health Management Based on Multi-Scale Semanticization

Fengxiao Tang, Xiaonan Wang, Xun Yuan et al.

Network device and system health management is the foundation of modern network operations and maintenance. Traditional health management methods, relying on expert identification or simple rule-based algorithms, struggle to cope with the heterogeneous networks (HNs) environment. Moreover, current state-of-the-art distributed fault diagnosis methods, which utilize specific machine learning techniques, lack multi-scale adaptivity for heterogeneous device information, resulting in unsatisfactory diagnostic accuracy for HNs. In this paper, we develop an LLM-assisted end-to-end intelligent network health management framework. The framework first proposes a multi-scale data scaling method based on unsupervised learning to address the multi-scale data problem in HNs. Secondly, we combine the semantic rule tree with the attention mechanism to propose a Multi-Scale Semanticized Anomaly Detection Model (MSADM) that generates network semantic information while detecting anomalies. Finally, we embed a chain-of-thought-based large-scale language model downstream to adaptively analyze the fault diagnosis results and create an analysis report containing detailed fault information and optimization strategies. We compare our scheme with other fault diagnosis models and demonstrate that it performs well on several metrics of network fault diagnosis.

4.8NIMay 21
Toward Realistic Wi-Fi Fault Diagnosis: A Multi-Modal Benchmark

Junjian Zhang, Haobo Deng, Xinxin Li et al.

Intelligent network operation and maintenance systems in modern networks continuously generate large volumes of multi-modal operational data. However, Wi-Fi fault diagnosis under heterogeneous operational environments remains insufficiently understood. We build a real-world Wi-Fi testbed deployed in campus working environments with an automated fault injection system, and collect a multi-modal Wi-Fi fault dataset containing over 10,000 fault samples across diverse wireless scenarios. To the best of our knowledge, this is among the first publicly available datasets jointly capturing heterogeneous cross-layer operational observations for Wi-Fi fault diagnosis. Based on this dataset, we establish a unified benchmark spanning multiple diagnosis tasks, operational modalities, and representative diagnosis paradigms. Experimental results indicate that effectively leveraging heterogeneous operational data remains challenging for existing diagnosis approaches. We further evaluate emerging LLM-based approaches and develop a reasoningoriented evaluation framework to assess the consistency between generated diagnostic analyses and actual network conditions. Our findings suggest several important considerations for future multi-modal Wi-Fi diagnosis.

CLFeb 6
Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

Xinxin Lin, Guangxin Dai, Yi Zhong et al.

Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.

CVMar 3, 2022
LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network

Zhigang Jiang, Zhongzheng Xiang, Jinhua Xu et al.

3D room layout estimation by a single panorama using deep neural networks has made great progress. However, previous approaches can not obtain efficient geometry awareness of room layout with the only latitude of boundaries or horizon-depth. We present that using horizon-depth along with room height can obtain omnidirectional-geometry awareness of room layout in both horizontal and vertical directions. In addition, we propose a planar-geometry aware loss function with normals and gradients of normals to supervise the planeness of walls and turning of corners. We propose an efficient network, LGT-Net, for room layout estimation, which contains a novel Transformer architecture called SWG-Transformer to model geometry relations. SWG-Transformer consists of (Shifted) Window Blocks and Global Blocks to combine the local and global geometry relations. Moreover, we design a novel relative position embedding of Transformer to enhance the spatial identification ability for the panorama. Experiments show that the proposed LGT-Net achieves better performance than current state-of-the-arts (SOTA) on benchmark datasets.

CVDec 1, 2024Code
DMFourLLIE: Dual-Stage and Multi-Branch Fourier Network for Low-Light Image Enhancement

Tongshun Zhang, Pingping Liu, Ming Zhao et al.

In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component's role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement. Our code is available at https://github.com/bywlzts/DMFourLLIE.

CVNov 1, 2025Code
ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

Xin Yao, Haiyang Zhao, Yimin Chen et al.

The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP's training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.

LGMar 14, 2023
A Contrastive Knowledge Transfer Framework for Model Compression and Transfer Learning

Kaiqi Zhao, Yitao Chen, Ming Zhao

Knowledge Transfer (KT) achieves competitive performance and is widely used for image classification tasks in model compression and transfer learning. Existing KT works transfer the information from a large model ("teacher") to train a small model ("student") by minimizing the difference of their conditionally independent output distributions. However, these works overlook the high-dimension structural knowledge from the intermediate representations of the teacher, which leads to limited effectiveness, and they are motivated by various heuristic intuitions, which makes it difficult to generalize. This paper proposes a novel Contrastive Knowledge Transfer Framework (CKTF), which enables the transfer of sufficient structural knowledge from the teacher to the student by optimizing multiple contrastive objectives across the intermediate representations between them. Also, CKTF provides a generalized agreement to existing KT techniques and increases their performance significantly by deriving them as specific cases of CKTF. The extensive evaluation shows that CKTF consistently outperforms the existing KT works by 0.04% to 11.59% in model compression and by 0.4% to 4.75% in transfer learning on various models and datasets.

LGAug 9, 2024
Federated Hypergraph Learning with Local Differential Privacy: Toward Privacy-Aware Hypergraph Structure Completion

Linfeng Luo, Zhiqi Guo, Fengxiao Tang et al.

The rapid growth of graph-structured data necessitates partitioning and distributed storage across decentralized systems, driving the emergence of federated graph learning to collaboratively train Graph Neural Networks (GNNs) without compromising privacy. However, current methods exhibit limited performance when handling hypergraphs, which inherently represent complex high-order relationships beyond pairwise connections. Partitioning hypergraph structures across federated subsystems amplifies structural complexity, hindering high-order information mining and compromising local information integrity. To bridge the gap between hypergraph learning and federated systems, we develop FedHGL, a first-of-its-kind framework for federated hypergraph learning on disjoint and privacy-constrained hypergraph partitions. Beyond collaboratively training a comprehensive hypergraph neural network across multiple clients, FedHGL introduces a pre-propagation hyperedge completion mechanism to preserve high-order structural integrity within each client. This procedure leverages the federated central server to perform cross-client hypergraph convolution without exposing internal topological information, effectively mitigating the high-order information loss induced by subgraph partitioning. Furthermore, by incorporating two kinds of local differential privacy (LDP) mechanisms, we provide formal privacy guarantees for this process, ensuring that sensitive node features remain protected against inference attacks from potentially malicious servers or clients. Experimental results on seven real-world datasets confirm the effectiveness of our approach and demonstrate its performance advantages over traditional federated graph learning methods.

CVSep 22, 2023
Poster: Self-Supervised Quantization-Aware Knowledge Distillation

Kaiqi Zhao, Ming Zhao

Quantization-aware training (QAT) starts with a pre-trained full-precision model and performs quantization during retraining. However, existing QAT works require supervision from the labels and they suffer from accuracy loss due to reduced precision. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation framework (SQAKD). SQAKD first unifies the forward and backward dynamics of various quantization functions and then reframes QAT as a co-optimization problem that simultaneously minimizes the KL-Loss and the discretization error, in a self-supervised manner. The evaluation shows that SQAKD significantly improves the performance of various state-of-the-art QAT works. SQAKD establishes stronger baselines and does not require extensive labeled training data, potentially making state-of-the-art QAT research more accessible.

LGMar 17, 2024Code
Self-Supervised Quantization-Aware Knowledge Distillation

Kaiqi Zhao, Ming Zhao

Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/SQAKD.git.

CLJan 20, 2025Code
Chat3GPP: An Open-Source Retrieval-Augmented Generation Framework for 3GPP Documents

Long Huang, Ming Zhao, Limin Xiao et al.

The 3rd Generation Partnership Project (3GPP) documents is key standards in global telecommunications, while posing significant challenges for engineers and researchers in the telecommunications field due to the large volume and complexity of their contents as well as the frequent updates. Large language models (LLMs) have shown promise in natural language processing tasks, but their general-purpose nature limits their effectiveness in specific domains like telecommunications. To address this, we propose Chat3GPP, an open-source retrieval-augmented generation (RAG) framework tailored for 3GPP specifications. By combining chunking strategies, hybrid retrieval and efficient indexing methods, Chat3GPP can efficiently retrieve relevant information and generate accurate responses to user queries without requiring domain-specific fine-tuning, which is both flexible and scalable, offering significant potential for adapting to other technical standards beyond 3GPP. We evaluate Chat3GPP on two telecom-specific datasets and demonstrate its superior performance compared to existing methods, showcasing its potential for downstream tasks like protocol generation and code automation.

LGSep 27, 2025Code
SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Junhao Xia, Ming Zhao, Limin Xiao et al.

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

CLSep 4, 2025Code
Chain or tree? Re-evaluating complex reasoning from the perspective of a matrix of thought

Fengxiao Tang, Yufeng Li, Zongzong Wu et al.

Large Language Models (LLMs) face significant accuracy degradation due to insufficient reasoning ability when dealing with complex and abstract tasks. Thought structures such as Chain of Thought (CoT) and Tree of Thought (ToT) focus on enhancing the reasoning capability of LLMs. However, they suffer from inherent drawbacks such as redundancy within the same layer of the tree structure and the singularity of the paths in the chain structure. Some studies have utilized Retrieval-Augmented Generation (RAG) methods to enhance CoT and ToT in mitigating hallucinations in LLMs, yet the fundamental shortcomings of the thought structures still persist. Furthermore, when dealing with multi-entity and multi-hop information, the retrieved verification knowledge often contains large amounts of fragmented, superficial, or even erroneous data, misleading the reasoning process of LLMs. To address these issues, we propose the Matrix of Thought (MoT), a novel and efficient thought structure for LLMs. MoT explores problems in both horizontal and vertical dimensions through a "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep thinking while reducing redundancy in the thought nodes within the column cells, thereby enhancing the reasoning capability of LLMs. Additionally, through a fact-correction mechanism, it leverages the knowledge graph triples retrieved by RAG and the original text to construct knowledge units and correct erroneous answers. To validate the effectiveness of this method, we conducted extensive experiments in three tasks: 24-point game, question answering evaluation, and proposition writing.The results demonstrate that our framework outperforms state-of-the-art methods, with reasoning time only 14.4\% of that of the baseline method, proving its efficiency and accuracy. The code for framework is available at https://github.com/lyfiter/mtqa.

CVNov 21, 2024
Multitask Learning for SAR Ship Detection with Gaussian-Mask Joint Segmentation

Ming Zhao, Xin Zhang, André Kaup

Detecting ships in synthetic aperture radar (SAR) images is challenging due to strong speckle noise, complex surroundings, and varying scales. This paper proposes MLDet, a multitask learning framework for SAR ship detection, consisting of object detection, speckle suppression, and target segmentation tasks. An angle classification loss with aspect ratio weighting is introduced to improve detection accuracy by addressing angular periodicity and object proportions. The speckle suppression task uses a dual-feature fusion attention mechanism to reduce noise and fuse shallow and denoising features, enhancing robustness. The target segmentation task, leveraging a rotated Gaussian-mask, aids the network in extracting target regions from cluttered backgrounds and improves detection efficiency with pixel-level predictions. The Gaussian-mask ensures ship centers have the highest probabilities, gradually decreasing outward under a Gaussian distribution. Additionally, a weighted rotated boxes fusion (WRBF) strategy combines multi-direction anchor predictions, filtering anchors beyond boundaries or with high overlap but low confidence. Extensive experiments on SSDD+ and HRSID datasets demonstrate the effectiveness and superiority of MLDet.

IRSep 18, 2024
EnhancedRL: An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Recommender Systems

Peng Liu, Cong Xu, Jiawei Zhu et al.

As a key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is responsible for merging multiple scores output by Multi-Task Learning (MTL) into a single score, finally determining the recommendation results. Recently, Reinforcement Learning (RL) has been applied to MTF to maximize long-term user satisfaction within a recommendation session. However, due to limitations in modeling paradigm, all existing RL algorithms for MTF can only utilize user features and statistical features as the state to generate actions at the user level, but unable to leverage item features and other valuable features, which leads to suboptimal performance. Overcoming this problem requires a breakthrough in the existing modeling paradigm, yet, to date, no prior work has addressed it. To tackle this challenge, we propose EnhancedRL, an innovative RL algorithm. Unlike existing RL-MTF methods, EnhancedRL takes the enhanced state as input, incorporating not only user features but also item features and other valuable information. Furthermore, it introduces a tailored actor-critic framework - including redesigned actor and critics and a novel learning procedure - to optimize long-term rewards at the user-item pair level within a recommendation session. Extensive offline and online experiments are conducted in an industrial RS and the results demonstrate that EnhancedRL outperforms other methods remarkably, achieving a +3.84% increase in user valid consumption and a +0.58% increase in user duration time. To the best of our knowledge, EnhancedRL is the first work to address this challenge, and it has been fully deployed in a large-scale RS since September 14, 2023, yielding significant improvements.

CVNov 29, 2023
LiDAR-based Outdoor Crowd Management for Smart Campus on the Edge

Yitao Chen, Krishna Gundu, Zohair Zaidi et al.

Crowd management is crucial for a smart campus. Popular methods are camera-based. However, conventional camera-based approaches may leak users' personally identifiable features, jeopardizing user's privacy, which limits its application. In this work, we investigate using affordable light detection and ranging (LiDAR) technology to perform outdoor crowd management leveraging edge computing. Specifically, we aim to count the number of people on a walkway of a university campus. Besides privacy protection, LiDAR sensors are superior to cameras since their performance will not be compromised when the campus is not well-illuminated. We deploy LiDAR sensors on light poles to collect data from the crowd on the campus and leverage edge accelerators to process data locally. We proposed two different methodologies in this work: 1) a non-convolutional neural network (CNN)-based approach, using clustering and autoencoder, and 2) a CNN-based approach that first projects point clouds to 2D planes and then processes the projection with conventional CNNs. Our first approach relies on careful feature engineering, whereas our second approach does not require such effort. However, the CNN-based approach requires more computational power than our non-CNN-based approach. We evaluate both approaches comprehensively with our hand-labeled real-life data collected from campus. Our evaluation results show that the first method achieves an accuracy of 85.4%, whereas the second method achieves 95.8%. Our CNN-based method outperforms existing solutions significantly. We also deploy our two models on an edge accelerator, TPU, to measure the speedup, leveraging this specialized accelerator.

CVNov 22, 2024
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Zengbao Sun, Ming Zhao, Gaorui Liu et al.

Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.

20.1NIApr 25
RadTwin: Generalizable Wireless Digital Twin for Dynamic Environments

Yuru Zhang, Ming Zhao, Qiang Liu et al.

Precisely modeling radio propagation in dynamic wireless environments is fundamental to the realization of wireless digital twins. Traditional ray tracing methods rely on accurate 3D models with detailed environment parameters, while recent neural radiance field approaches learn representations tied to specific static scenes, requiring retraining when environments change. In this paper, we propose RadTwin, a generalizable wireless digital twin framework that explicitly conditions on scene geometry, enabling adaptation to dynamic environments without retraining. RadTwin comprises three key components: 1) a scenario representation network that extracts high-level latent scene features from point clouds, 2) an electromagnetic ray tracing module that computes physics-informed sparse attention masks identifying voxels that physically contribute signals toward each query direction, and 3) a neural propagation decoder that aggregates relevant scene features through masked cross-attention to learn how radio propagation behaves within the given scene geometry. We evaluate RadTwin on a customized dataset of indoor scenes with varying furniture arrangements. Experimental results show that RadTwin achieves 31.6% higher SSIM (0.846 vs. 0.643) and 91.96% lower LPIPS (0.023 vs. 0.286) compared to NeRF2. RadTwin further demonstrates superior cross-scale performance and high generalization and data efficiency, representing a significant advancement toward practical digital network twins for dynamic wireless environments.

IRApr 19, 2024
UnifiedRL: A Reinforcement Learning Algorithm Tailored for Multi-Task Fusion in Large-Scale Recommender Systems

Peng Liu, Cong Xu, Ming Zhao et al.

As the last pivotal stage of Recommender System (RS), Multi-Task Fusion (MTF) is responsible for combining multiple scores outputted by Multi-Task Learning (MTL) model into a final score to maximize user satisfaction. Recently, to optimize long-term user satisfaction, Reinforcement Learning (RL) is used for MTF in RSs. However, the existing offline RL algorithms used for MTF have the following severe problems: a) To avoid Out-of-Distribution (OOD), their constraints are overly strict, which seriously damage performance; b) They are unaware of the exploration policy used to collect training data, only suboptimal policy can be learned; c) Their exploration policies are inefficient and hurt user experience. To solve the above problems, we propose an innovative method called UnifiedRL tailored for MTF in large-scale RSs. UnifiedRL seamlessly integrates offline RL model with its custom exploration policy to relax overly strict constraints, which is different from existing RL-MTF methods and significantly improves performance. In addition, compared to existing exploration policies, UnifiedRL's custom exploration policy is highly efficient, enabling frequent online exploration and offline training iterations, which further improves performance. Extensive offline and online experiments are conducted in a large-scale RS. The results demonstrate that UnifiedRL outperforms other existing MTF methods remarkably, achieving a +4.64% increase in user valid consumption and a +1.74% increase in user duration time. To the best of our knowledge, UnifiedRL is the first RL algorithm tailored for MTF in RSs and has been successfully deployed in multiple large-scale RSs since June 2023, yielding significant benefits.

CVJun 27, 2025
ReF-LLE: Personalized Low-Light Enhancement via Reference-Guided Deep Reinforcement Learning

Ming Zhao, Pingping Liu, Tongshun Zhang et al.

Low-light image enhancement presents two primary challenges: 1) Significant variations in low-light images across different conditions, and 2) Enhancement levels influenced by subjective preferences and user intent. To address these issues, we propose ReF-LLE, a novel personalized low-light image enhancement method that operates in the Fourier frequency domain and incorporates deep reinforcement learning. ReF-LLE is the first to integrate deep reinforcement learning into this domain. During training, a zero-reference image evaluation strategy is introduced to score enhanced images, providing reward signals that guide the model to handle varying degrees of low-light conditions effectively. In the inference phase, ReF-LLE employs a personalized adaptive iterative strategy, guided by the zero-frequency component in the Fourier domain, which represents the overall illumination level. This strategy enables the model to adaptively adjust low-light images to align with the illumination distribution of a user-provided reference image, ensuring personalized enhancement results. Extensive experiments on benchmark datasets demonstrate that ReF-LLE outperforms state-of-the-art methods, achieving superior perceptual quality and adaptability in personalized low-light image enhancement.

AO-PHApr 30, 2025
Advancing Seasonal Prediction of Tropical Cyclone Activity with a Hybrid AI-Physics Climate Model

Gan Zhang, Megha Rao, Janni Yuval et al.

Machine learning (ML) models are successful with weather forecasting and have shown progress in climate simulations, yet leveraging them for useful climate predictions needs exploration. Here we show this feasibility using Neural General Circulation Model (NeuralGCM), a hybrid ML-physics atmospheric model developed by Google, for seasonal predictions of large-scale atmospheric variability and Northern Hemisphere tropical cyclone (TC) activity. Inspired by physical model studies, we simplify boundary conditions, assuming sea surface temperature (SST) and sea ice follow their climatological cycle but persist anomalies present at the initialization time. With such forcings, NeuralGCM can generate 100 simulation days in ~8 minutes with a single Graphics Processing Unit (GPU), while simulating realistic atmospheric circulation and TC climatology patterns. This configuration yields useful seasonal predictions (July to November) for the tropical atmosphere and various TC activity metrics. Notably, the predicted and observed TC frequency in the North Atlantic and East Pacific basins are significantly correlated during 1990 to 2023 (r=~0.7), suggesting prediction skill comparable to existing physical GCMs. Despite challenges associated with model resolution and simplified boundary forcings, the model-predicted interannual variations demonstrate significant correlations with the observation, including the sub-basin TC tracks (p<0.1) and basin-wide accumulated cyclone energy (p<0.01) of the North Atlantic and North Pacific basins. These findings highlight the promise of leveraging ML models with physical insights to model TC risks and deliver seamless weather-climate predictions.

IRMay 21, 2024
Dynamic User Interest Augmentation via Stream Clustering and Memory Networks in Large-Scale Recommender Systems

Peng Liu, Nian Wang, Cong Xu et al.

Recommender System (RS) provides personalized recommendation service based on user interest. However, lots of users' interests are sparse due to lacking consumption behaviors, making it challenging to provide accurate recommendations for them, which is widespread in large-scale RSs. In particular, efficiently solving this problem in the ranking stage of RS is an even greater challenge, which requires an end-to-end and real-time approach. To solve this problem, we propose an innovative method called Dynamic User Interest Augmentation (DUIA). DUIA enhances user interest including user profile and user history behavior sequences by generating enhancement vectors and personalized enhancement vectors through dynamic stream clustering of similar users and relevant items from multiple perspectives. To realize stream clustering, we specially design an algorithm called Gradient-based Hierarchical Clustering Algorithm (GHCA) for DUIA, which performs clustering via gradient descent and stores the cluster centers in memory networks. Extensive offline and online experiments demonstrate that DUIA not only significantly improves model performance for users with sparse interests but also delivers notable gains for other users. As an end-to-end method, DUIA can be easily integrated with existing models. Furthermore, DUIA is also used for long-tail items and cold-start problem, which also yields excellent improvements. Since 2022, DUIA has been successfully deployed in multiple industrial RSs in Tencent and was made public in May 2024. Moreover, the thoughts behind DUIA, dynamic stream clustering and similarity-based enhancement, have inspired relevant works and have also been applied in other stages of RS.

LGNov 19, 2025
FaultDiffusion: Few-Shot Fault Time Series Generation with Diffusion Model

Yi Xu, Zhigang Chen, Rui Wang et al.

In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of fault events and the high cost of data annotation, significantly hinders data-driven approaches. Existing time-series generation models, optimized for abundant normal data, struggle to capture fault distributions in few-shot scenarios, producing samples that lack authenticity and diversity due to the large domain gap and high intra-class variability of faults. To address this, we propose a novel few-shot fault time-series generation framework based on diffusion models. Our approach employs a positive-negative difference adapter, leveraging pre-trained normal data distributions to model the discrepancies between normal and fault domains for accurate fault synthesis. Additionally, a diversity loss is introduced to prevent mode collapse, encouraging the generation of diverse fault samples through inter-sample difference regularization. Experimental results demonstrate that our model significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.

CVOct 11, 2025
From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology

Yizhi Wang, Li Chen, Qiang Huang et al.

Cervical cancer remains a major malignancy, necessitating extensive and complex histopathological assessments and comprehensive support tools. Although deep learning shows promise, these models still lack accuracy and generalizability. General foundation models offer a broader reach but remain limited in capturing subspecialty-specific features and task adaptability. We introduce the Cervical Subspecialty Pathology (CerS-Path) diagnostic system, developed through two synergistic pretraining stages: self-supervised learning on approximately 190 million tissue patches from 140,000 slides to build a cervical-specific feature extractor, and multimodal enhancement with 2.5 million image-text pairs, followed by integration with multiple downstream diagnostic functions. Supporting eight diagnostic functions, including rare cancer classification and multimodal Q&A, CerS-Path surpasses prior foundation models in scope and clinical applicability. Comprehensive evaluations demonstrate a significant advance in cervical pathology, with prospective testing on 3,173 cases across five centers maintaining 99.38% screening sensitivity and excellent generalizability, highlighting its potential for subspecialty diagnostic translation and cervical cancer screening.

CVOct 3, 2025
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao, Wenhui Dong, Yang Zhang et al.

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

CVSep 24, 2025
MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning

Zeyu He, Shuai Huang, Yuwu Lu et al.

Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method's ability to reduce estimation bias and improve incremental learning robustness.

CVAug 13, 2025
WEC-DG: Multi-Exposure Wavelet Correction Method Guided by Degradation Description

Ming Zhao, Pingping Liu, Tongshun Zhang et al.

Multi-exposure correction technology is essential for restoring images affected by insufficient or excessive lighting, enhancing the visual experience by improving brightness, contrast, and detail richness. However, current multi-exposure correction methods often encounter challenges in addressing intra-class variability caused by diverse lighting conditions, shooting environments, and weather factors, particularly when processing images captured at a single exposure level. To enhance the adaptability of these models under complex imaging conditions, this paper proposes a Wavelet-based Exposure Correction method with Degradation Guidance (WEC-DG). Specifically, we introduce a degradation descriptor within the Exposure Consistency Alignment Module (ECAM) at both ends of the processing pipeline to ensure exposure consistency and achieve final alignment. This mechanism effectively addresses miscorrected exposure anomalies caused by existing methods' failure to recognize 'blurred' exposure degradation. Additionally, we investigate the light-detail decoupling properties of the wavelet transform to design the Exposure Restoration and Detail Reconstruction Module (EDRM), which processes low-frequency information related to exposure enhancement before utilizing high-frequency information as a prior guide for reconstructing spatial domain details. This serial processing strategy guarantees precise light correction and enhances detail recovery. Extensive experiments conducted on multiple public datasets demonstrate that the proposed method outperforms existing algorithms, achieving significant performance improvements and validating its effectiveness and practical applicability.

CRJul 15, 2025
LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification

Fengxiao Tang, Huan Li, Ming Zhao et al.

Verifying the credibility of Cyber Threat Intelligence (CTI) is essential for reliable cybersecurity defense. However, traditional approaches typically treat this task as a static classification problem, relying on handcrafted features or isolated deep learning models. These methods often lack the robustness needed to handle incomplete, heterogeneous, or noisy intelligence, and they provide limited transparency in decision-making-factors that reduce their effectiveness in real-world threat environments. To address these limitations, we propose LRCTI, a Large Language Model (LLM)-based framework designed for multi-step CTI credibility verification. The framework first employs a text summarization module to distill complex intelligence reports into concise and actionable threat claims. It then uses an adaptive multi-step evidence retrieval mechanism that iteratively identifies and refines supporting information from a CTI-specific corpus, guided by LLM feedback. Finally, a prompt-based Natural Language Inference (NLI) module is applied to evaluate the credibility of each claim while generating interpretable justifications for the classification outcome. Experiments conducted on two benchmark datasets, CTI-200 and PolitiFact show that LRCTI improves F1-Macro and F1-Micro scores by over 5%, reaching 90.9% and 93.6%, respectively, compared to state-of-the-art baselines. These results demonstrate that LRCTI effectively addresses the core limitations of prior methods, offering a scalable, accurate, and explainable solution for automated CTI credibility verification

CLSep 4, 2023
UniSA: Unified Generative Framework for Sentiment Analysis

Zaijing Li, Ting-En Lin, Yuchuan Wu et al.

Sentiment analysis is a crucial task that aims to understand people's emotional states and predict emotional categories based on multimodal information. It consists of several subtasks, such as emotion recognition in conversation (ERC), aspect-based sentiment analysis (ABSA), and multimodal sentiment analysis (MSA). However, unifying all subtasks in sentiment analysis presents numerous challenges, including modality alignment, unified input/output forms, and dataset bias. To address these challenges, we propose a Task-Specific Prompt method to jointly model subtasks and introduce a multimodal generative framework called UniSA. Additionally, we organize the benchmark datasets of main subtasks into a new Sentiment Analysis Evaluation benchmark, SAEval. We design novel pre-training tasks and training methods to enable the model to learn generic sentiment knowledge among subtasks to improve the model's multimodal sentiment perception ability. Our experimental results show that UniSA performs comparably to the state-of-the-art on all subtasks and generalizes well to various subtasks in sentiment analysis.

LGJan 22, 2022
Iterative Activation-based Structured Pruning

Kaiqi Zhao, Animesh Jain, Ming Zhao

Deploying complex deep learning models on edge devices is challenging because they have substantial compute and memory resource requirements, whereas edge devices' resource budget is limited. To solve this problem, extensive pruning techniques have been proposed for compressing networks. Recent advances based on the Lottery Ticket Hypothesis (LTH) show that iterative model pruning tends to produce smaller and more accurate models. However, LTH research focuses on unstructured pruning, which is hardware-inefficient and difficult to accelerate on hardware platforms. In this paper, we investigate iterative pruning in the context of structured pruning because structurally pruned models map well on commodity hardware. We find that directly applying a structured weight-based pruning technique iteratively, called iterative L1-norm based pruning (ILP), does not produce accurate pruned models. To solve this problem, we propose two activation-based pruning methods, Iterative Activation-based Pruning (IAP) and Adaptive Iterative Activation-based Pruning (AIAP). We observe that, with only 1% accuracy loss, IAP and AIAP achieve 7.75X and 15.88$X compression on LeNet-5, and 1.25X and 1.71X compression on ResNet-50, whereas ILP achieves 4.77X and 1.13X, respectively.

LGJan 22, 2022
Enabling Deep Learning on Edge Devices through Filter Pruning and Knowledge Transfer

Kaiqi Zhao, Yitao Chen, Ming Zhao

Deep learning models have introduced various intelligent applications to edge devices, such as image classification, speech recognition, and augmented reality. There is an increasing need of training such models on the devices in order to deliver personalized, responsive, and private learning. To address this need, this paper presents a new solution for deploying and training state-of-the-art models on the resource-constrained devices. First, the paper proposes a novel filter-pruning-based model compression method to create lightweight trainable models from large models trained in the cloud, without much loss of accuracy. Second, it proposes a novel knowledge transfer method to enable the on-device model to update incrementally in real time or near real time using incremental learning on new data and enable the on-device model to learn the unseen categories with the help of the in-cloud model in an unsupervised fashion. The results show that 1) our model compression method can remove up to 99.36% parameters of WRN-28-10, while preserving a Top-1 accuracy of over 90% on CIFAR-10; 2) our knowledge transfer method enables the compressed models to achieve more than 90% accuracy on CIFAR-10 and retain good accuracy on old categories; 3) it allows the compressed models to converge within real time (three to six minutes) on the edge for incremental learning tasks; 4) it enables the model to classify unseen categories of data (78.92% Top-1 accuracy) that it is never trained with.

CVJan 21, 2022
Adaptive Activation-based Structured Pruning

Kaiqi Zhao, Animesh Jain, Ming Zhao

Pruning is a promising approach to compress complex deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yields models that cannot efficiently run on commodity hardware and require users to manually explore and tune the pruning process, which is time-consuming and often leads to sub-optimal results. To address these limitations, this paper presents an adaptive, activation-based, structured pruning approach to automatically and efficiently generate small, accurate, and hardware-efficient models that meet user requirements. First, it proposes iterative structured pruning using activation-based attention feature maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that the proposed method can substantially outperform the state-of-the-art structured pruning works on CIFAR-10 and ImageNet datasets. For example, on ResNet-56 with CIFAR-10, without any accuracy drop, our method achieves the largest parameter reduction (79.11%), outperforming the related works by 22.81% to 66.07%, and the largest FLOPs reduction (70.13%), outperforming the related works by 14.13% to 26.53%.

IROct 22, 2021
MIC: Model-agnostic Integrated Cross-channel Recommenders

Yujie Lu, Ping Nie, Shengyu Zhang et al.

Semantically connecting users and items is a fundamental problem for the matching stage of an industrial recommender system. Recent advances in this topic are based on multi-channel retrieval to efficiently measure users' interest on items from the massive candidate pool. However, existing work are primarily built upon pre-defined retrieval channels, including User-CF (U2U), Item-CF (I2I), and Embedding-based Retrieval (U2I), thus access to the limited correlation between users and items which solely entail from partial information of latent interactions. In this paper, we propose a model-agnostic integrated cross-channel (MIC) approach for the large-scale recommendation, which maximally leverages the inherent multi-channel mutual information to enhance the matching performance. Specifically, MIC robustly models correlation within user-item, user-user, and item-item from latent interactions in a universal schema. For each channel, MIC naturally aligns pairs with semantic similarity and distinguishes them otherwise with more uniform anisotropic representation space. While state-of-the-art methods require specific architectural design, MIC intuitively considers them as a whole by enabling the complete information flow among users and items. Thus MIC can be easily plugged into other retrieval recommender systems. Extensive experiments show that our MIC helps several state-of-the-art models boost their performance on two real-world benchmarks. The satisfactory deployment of the proposed MIC on industrial online services empirically proves its scalability and flexibility.

CLJun 16, 2021
SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

Zaijing Li, Fengxiao Tang, Tieyu Sun et al.

For the task of conversation emotion recognition, recent works focus on speaker relationship modeling but ignore the role of utterance's emotional tendency.In this paper, we propose a new expression paradigm of sentence-level emotion orientation vector to model the potential correlation of emotions between sentence vectors. Based on it, we design an emotion recognition model, which extracts the sentence-level emotion orientation vectors from the language model and jointly learns from the dialogue sentiment analysis model and extracted sentence-level emotion orientation vectors to identify the speaker's emotional orientation during the conversation. We conduct experiments on two benchmark datasets and compare them with the five baseline models.The experimental results show that our model has better performance on all data sets.

AIJan 26, 2021
On the Evaluation of Vision-and-Language Navigation Instructions

Ming Zhao, Peter Anderson, Vihan Jain et al.

Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions. However, existing instruction generators have not been comprehensively evaluated, and the automatic evaluation metrics used to develop them have not been validated. Using human wayfinders, we show that these generators perform on par with or only slightly better than a template-based generator and far worse than human instructors. Furthermore, we discover that BLEU, ROUGE, METEOR and CIDEr are ineffective for evaluating grounded navigation instructions. To improve instruction evaluation, we propose an instruction-trajectory compatibility model that operates without reference instructions. Our model shows the highest correlation with human wayfinding outcomes when scoring individual instructions. For ranking instruction generation systems, if reference instructions are available we recommend using SPICE.

CVNov 18, 2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Bowen Zhang, Hexiang Hu, Joonseok Lee et al.

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.

LGMar 28, 2020
Semi-Federated Learning

Zhikun Chen, Daofeng Li, Ming Zhao et al.

Federated learning (FL) enables massive distributed Information and Communication Technology (ICT) devices to learn a global consensus model without any participants revealing their own data to the central server. However, the practicality, communication expense and non-independent and identical distribution (Non-IID) data challenges in FL still need to be concerned. In this work, we propose the Semi-Federated Learning (Semi-FL) which differs from the FL in two aspects, local clients clustering and in-cluster training. A sequential training manner is designed for our in-cluster training in this paper which enables the neighboring clients to share their learning models. The proposed Semi-FL can be easily applied to future mobile communication networks and require less up-link transmission bandwidth. Numerical experiments validate the feasibility, learning performance and the robustness to Non-IID data of the proposed Semi-FL. The Semi-FL extends the existing potentials of FL.

SPFeb 13, 2020
A comprehensive review on convolutional neural network in machine fault diagnosis

Jinyang Jiao, Ming Zhao, Jing Lin et al.

With the rapid development of manufacturing industry, machine fault diagnosis has become increasingly significant to ensure safe equipment operation and production. Consequently, multifarious approaches have been explored and developed in the past years, of which intelligent algorithms develop particularly rapidly. Convolutional neural network, as a typical representative of intelligent diagnostic models, has been extensively studied and applied in recent five years, and a large amount of literature has been published in academic journals and conference proceedings. However, there has not been a systematic review to cover these studies and make a prospect for the further research. To fill in this gap, this work attempts to review and summarize the development of the Convolutional Network based Fault Diagnosis (CNFD) approaches comprehensively. Generally, a typical CNFD framework is composed of the following steps, namely, data collection, model construction, and feature learning and decision making, thus this paper is organized by following this stream. Firstly, data collection process is described, in which several popular datasets are introduced. Then, the fundamental theory from the basic convolutional neural network to its variants is elaborated. After that, the applications of CNFD are reviewed in terms of three mainstream directions, i.e. classification, prediction and transfer diagnosis. Finally, conclusions and prospects are presented to point out the characteristics of current development, facing challenges and future trends. Last but not least, it is expected that this work would provide convenience and inspire further exploration for researchers in this field.

RODec 9, 2017
ESD CYCLOPS: A new robotic surgical system for GI surgery

Timo J. C. Oude Vrielink, Ming Zhao, Ara Darzi et al.

Gastrointestinal (GI) cancers account for 1.5 million deaths worldwide. Endoscopic Submucosal Dissection (ESD) is an advanced therapeutic endoscopy technique with superior clinical outcome due to the minimally invasive and en bloc removal of tumours. In the western world, ESD is seldom carried out, due to its complex and challenging nature. Various surgical systems are being developed to make this therapy accessible, however, these solutions have shown limited operational workspace, dexterity, or low force exertion capabilities. The current paper shows the ESD CYCLOPS system, a bimanual surgical robotic attachment that can be mounted at the end of any flexible endoscope. The system is able to achieve forces of up to 46N, and showed a mean error of 0.217mm during an elliptical tracing task. The workspace and instrument dexterity is shown by pre-clinical ex vivo trials, in which ESD is succesfully performed by a GI surgeon. The system is currently undergoing pre-clinical in vivo validation.