h-index97
478papers
15,033citations
Novelty52%
AI Score63

478 Papers

ASNov 21, 2022
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Qiushi Zhu, Long Zhou, Ziqiang Zhang et al. · microsoft-research

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

LGJun 29, 2023Code
RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark

Federico Berto, Chuanbo Hua, Junyoung Park et al. · pku

Combinatorial optimization (CO) is fundamental to several real-world applications, from logistics and scheduling to hardware design and resource allocation. Deep reinforcement learning (RL) has recently shown significant benefits in solving CO problems, reducing reliance on domain expertise and improving computational efficiency. However, the absence of a unified benchmarking framework leads to inconsistent evaluations, limits reproducibility, and increases engineering overhead, raising barriers to adoption for new researchers. To address these challenges, we introduce RL4CO, a unified and extensive benchmark with in-depth library coverage of 27 CO problem environments and 23 state-of-the-art baselines. Built on efficient software libraries and best practices in implementation, RL4CO features modularized implementation and flexible configurations of diverse environments, policy architectures, RL algorithms, and utilities with extensive documentation. RL4CO helps researchers build on existing successes while exploring and developing their own designs, facilitating the entire research process by decoupling science from heavy engineering. We finally provide extensive benchmark studies to inspire new insights and future work. RL4CO has already attracted numerous researchers in the community and is open-sourced at https://github.com/ai4co/rl4co.

LGDec 12, 2022
Accelerating Dataset Distillation via Model Augmentation

Lei Zhang, Jie Zhang, Bowen Lei et al. · microsoft-research

Dataset Distillation (DD), a newly emerging field, aims at generating much smaller but efficient synthetic training datasets from large ones. Existing DD methods based on gradient matching achieve leading performance; however, they are extremely computationally intensive as they require continuously optimizing a dataset among thousands of randomly initialized models. In this paper, we assume that training the synthetic data with diverse models leads to better generalization performance. Thus we propose two model augmentation techniques, i.e. using early-stage models and parameter perturbation to learn an informative synthetic set with significantly reduced training cost. Extensive experiments demonstrate that our method achieves up to 20x speedup and comparable performance on par with state-of-the-art methods.

LGAug 14, 2024Code
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo et al.

Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and more than ten machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications.

CVApr 9, 2023Code
CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Zhongqi Wang, Jie Zhang, Zhilong Ji et al.

With the development of deep generative models, recent years have seen great success of Chinese landscape painting generation. However, few works focus on controllable Chinese landscape painting generation due to the lack of data and limited modeling capabilities. In this work, we propose a controllable Chinese landscape painting generation method named CCLAP, which can generate painting with specific content and style based on Latent Diffusion Model. Specifically, it consists of two cascaded modules, i.e., content generator and style aggregator. The content generator module guarantees the content of generated paintings specific to the input text. While the style aggregator module is to generate paintings of a style corresponding to a reference image. Moreover, a new dataset of Chinese landscape paintings named CLAP is collected for comprehensive evaluation. Both the qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance, especially in artfully-composed and artistic conception. Codes are available at https://github.com/Robin-WZQ/CCLAP.

AIMar 4, 2023Code
Neural Airport Ground Handling

Yaoxin Wu, Jianan Zhou, Yunwen Xia et al.

Airport ground handling (AGH) offers necessary operations to flights during their turnarounds and is of great importance to the efficiency of airport management and the economics of aviation. Such a problem involves the interplay among the operations that leads to NP-hard problems with complex constraints. Hence, existing methods for AGH are usually designed with massive domain knowledge but still fail to yield high-quality solutions efficiently. In this paper, we aim to enhance the solution quality and computation efficiency for solving AGH. Particularly, we first model AGH as a multiple-fleet vehicle routing problem (VRP) with miscellaneous constraints including precedence, time windows, and capacity. Then we propose a construction framework that decomposes AGH into sub-problems (i.e., VRPs) in fleets and present a neural method to construct the routing solutions to these sub-problems. In specific, we resort to deep learning and parameterize the construction heuristic policy with an attention-based neural network trained with reinforcement learning, which is shared across all sub-problems. Extensive experiments demonstrate that our method significantly outperforms classic meta-heuristics, construction heuristics and the specialized methods for AGH. Besides, we empirically verify that our neural method generalizes well to instances with large numbers of flights or varying parameters, and can be readily adapted to solve real-time AGH with stochastic flight arrivals. Our code is publicly available at: https://github.com/RoyalSkye/AGH.

AIMay 31Code
FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

Hongxu Ma, Han Zhou, Chenghou Jin et al.

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

CVFeb 20, 2023Code
Pseudo Label-Guided Model Inversion Attack via Conditional Generative Adversarial Network

Xiaojian Yuan, Kejiang Chen, Jie Zhang et al.

Model inversion (MI) attacks have raised increasing concerns about privacy, which can reconstruct training data from public models. Indeed, MI attacks can be formalized as an optimization problem that seeks private data in a certain space. Recent MI attacks leverage a generative adversarial network (GAN) as an image prior to narrow the search space, and can successfully reconstruct even the high-dimensional data (e.g., face images). However, these generative MI attacks do not fully exploit the potential capabilities of the target model, still leading to a vague and coupled search space, i.e., different classes of images are coupled in the search space. Besides, the widely used cross-entropy loss in these attacks suffers from gradient vanishing. To address these problems, we propose Pseudo Label-Guided MI (PLG-MI) attack via conditional GAN (cGAN). At first, a top-n selection strategy is proposed to provide pseudo-labels for public data, and use pseudo-labels to guide the training of the cGAN. In this way, the search space is decoupled for different classes of images. Then a max-margin loss is introduced to improve the search process on the subspace of a target class. Extensive experiments demonstrate that our PLG-MI attack significantly improves the attack success rate and visual quality for various datasets and models, notably, 2~3 $\times$ better than state-of-the-art attacks under large distributional shifts. Our code is available at: https://github.com/LetheSec/PLG-MI-Attack.

CRAug 21, 2023Code
Backdooring Textual Inversion for Concept Censorship

Yutong Wu, Jie Zhang, Florian Kerschbaum et al.

Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.

LGSep 1, 2022
Federated Learning with Label Distribution Skew via Logits Calibration

Jie Zhang, Zhiqi Li, Bo Li et al.

Traditional federated optimization methods perform poorly with heterogeneous data (ie, accuracy reduction), especially for highly skewed data. In this paper, we investigate the label distribution skew in FL, where the distribution of labels varies across clients. First, we investigate the label distribution skew from a statistical view. We demonstrate both theoretically and empirically that previous methods based on softmax cross-entropy are not suitable, which can result in local models heavily overfitting to minority classes and missing classes. Additionally, we theoretically introduce a deviation bound to measure the deviation of the gradient after local update. At last, we propose FedLC (\textbf {Fed} erated learning via\textbf {L} ogits\textbf {C} alibration), which calibrates the logits before softmax cross-entropy according to the probability of occurrence of each class. FedLC applies a fine-grained calibrated cross-entropy loss to local update by adding a pairwise label margin. Extensive experiments on federated datasets and real-world datasets demonstrate that FedLC leads to a more accurate global model and much improved performance. Furthermore, integrating other FL methods into our approach can further enhance the performance of the global model.

CVMay 27
Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Niantong Li, Guangzheng Hu, Weixu Qiao et al.

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

CVJul 5, 2024Code
T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Zhongqi Wang, Jie Zhang, Shiguang Shan et al.

While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$\%$ with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4$\%$ and invalidates 99$\%$ poisoned samples. Codes are released at https://github.com/Robin-WZQ/T2IShield.

CVJun 3
Qwen-Image-Flash: Beyond Objective Design

Tianhe Wu, Kun Yan, Zikai Zhou et al.

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

OSJun 3
GNStor: Design of GPU-Native High-Performance Remote All-Flash Array

Shushu Yi, Wenbo Wu, Guoci Chen et al.

GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems.

LGMar 13, 2023
TARGET: Federated Class-Continual Learning via Exemplar-Free Distillation

Jie Zhang, Chen Chen, Weiming Zhuang et al.

This paper focuses on an under-explored yet important problem: Federated Class-Continual Learning (FCCL), where new classes are dynamically added in federated learning. Existing FCCL works suffer from various limitations, such as requiring additional datasets or storing the private data from previous tasks. In response, we first demonstrate that non-IID data exacerbates catastrophic forgetting issue in FL. Then we propose a novel method called TARGET (federat\textbf{T}ed cl\textbf{A}ss-continual lea\textbf{R}nin\textbf{G} via \textbf{E}xemplar-free dis\textbf{T}illation), which alleviates catastrophic forgetting in FCCL while preserving client data privacy. Our proposed method leverages the previously trained global model to transfer knowledge of old tasks to the current task at the model level. Moreover, a generator is trained to produce synthetic data to simulate the global distribution of data on each client at the data level. Compared to previous FCCL methods, TARGET does not require any additional datasets or storing real data from previous tasks, which makes it ideal for data-sensitive scenarios.

CLAug 22, 2024Code
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models

Kunsheng Tang, Wenbo Zhou, Jie Zhang et al.

Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but they have also been observed to magnify societal biases, particularly those related to gender. In response to this issue, several benchmarks have been proposed to assess gender bias in LLMs. However, these benchmarks often lack practical flexibility or inadvertently introduce biases. To address these shortcomings, we introduce GenderCARE, a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics for quantifying and mitigating gender bias in LLMs. To begin, we establish pioneering criteria for gender equality benchmarks, spanning dimensions such as inclusivity, diversity, explainability, objectivity, robustness, and realisticity. Guided by these criteria, we construct GenderPair, a novel pair-based benchmark designed to assess gender bias in LLMs comprehensively. Our benchmark provides standardized and realistic evaluations, including previously overlooked gender groups such as transgender and non-binary individuals. Furthermore, we develop effective debiasing techniques that incorporate counterfactual data augmentation and specialized fine-tuning strategies to reduce gender bias in LLMs without compromising their overall performance. Extensive experiments demonstrate a significant reduction in various gender bias benchmarks, with reductions peaking at over 90% and averaging above 35% across 17 different LLMs. Importantly, these reductions come with minimal variability in mainstream language tasks, remaining below 2%. By offering a realistic assessment and tailored reduction of gender biases, we hope that our GenderCARE can represent a significant step towards achieving fairness and equity in LLMs. More details are available at https://github.com/kstanghere/GenderCARE-ccs24.

CVAug 19, 2022
Hierarchical Compositional Representations for Few-shot Action Recognition

Changzhen Li, Jie Zhang, Shuzhe Wu et al.

Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.

LGFeb 19, 2023
Delving into the Adversarial Robustness of Federated Learning

Jie Zhang, Bo Li, Chen Chen et al.

In Federated Learning (FL), models are as fragile as centrally trained models against adversarial examples. However, the adversarial robustness of federated learning remains largely unexplored. This paper casts light on the challenge of adversarial robustness of federated learning. To facilitate a better understanding of the adversarial vulnerability of the existing FL methods, we conduct comprehensive robustness evaluations on various attacks and adversarial training methods. Moreover, we reveal the negative impacts induced by directly adopting adversarial training in FL, which seriously hurts the test accuracy, especially in non-IID settings. In this work, we propose a novel algorithm called Decision Boundary based Federated Adversarial Training (DBFAT), which consists of two components (local re-weighting and global regularization) to improve both accuracy and robustness of FL systems. Extensive experiments on multiple datasets demonstrate that DBFAT consistently outperforms other baselines under both IID and non-IID settings.

IRJun 22, 2022
DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation

Zhu Sun, Hui Fang, Jie Yang et al.

Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.

LGNov 15, 2022
FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

Jinyu Chen, Wenchao Xu, Song Guo et al.

Federated Learning (FL) is an emerging paradigm that enables distributed users to collaboratively and iteratively train machine learning models without sharing their private data. Motivated by the effectiveness and robustness of self-attention-based architectures, researchers are turning to using pre-trained Transformers (i.e., foundation models) instead of traditional convolutional neural networks in FL to leverage their excellent transfer learning capabilities. Despite recent progress, how pre-trained Transformer models play a role in FL remains obscure, that is, how to efficiently fine-tune these pre-trained models in FL and how FL users could benefit from this new paradigm. In this paper, we explore this issue and demonstrate that the fine-tuned Transformers achieve extraordinary performance on FL, and that the lightweight fine-tuning method facilitates a fast convergence rate and low communication costs. Concretely, we conduct a rigorous empirical study of three tuning methods (i.e., modifying the input, adding extra modules, and adjusting the backbone) using two types of pre-trained models (i.e., vision-language models and vision models) for FL. Our experiments show that 1) Fine-tuning the bias term of the backbone performs best when relying on a strong pre-trained model; 2) The vision-language model (e.g., CLIP) outperforms the pure vision model (e.g., ViT) and is more robust to the few-shot settings; 3) Compared to pure local training, FL with pre-trained models has a higher accuracy because it alleviates the problem of over-fitting. We will release our code and encourage further exploration of pre-trained Transformers and FL.

LGSep 13, 2022
Learning to Solve Multiple-TSP with Time Window and Rejections via Deep Reinforcement Learning

Rongkai Zhang, Cong Zhang, Zhiguang Cao et al.

We propose a manager-worker framework based on deep reinforcement learning to tackle a hard yet nontrivial variant of Travelling Salesman Problem (TSP), \ie~multiple-vehicle TSP with time window and rejections (mTSPTWR), where customers who cannot be served before the deadline are subject to rejections. Particularly, in the proposed framework, a manager agent learns to divide mTSPTWR into sub-routing tasks by assigning customers to each vehicle via a Graph Isomorphism Network (GIN) based policy network. A worker agent learns to solve sub-routing tasks by minimizing the cost in terms of both tour length and rejection rate for each vehicle, the maximum of which is then fed back to the manager agent to learn better assignments. Experimental results demonstrate that the proposed framework outperforms strong baselines in terms of higher solution quality and shorter computation time. More importantly, the trained agents also achieve competitive performance for solving unseen larger instances.

AIFeb 27, 2023
Learning Large Neighborhood Search for Vehicle Routing in Airport Ground Handling

Jianan Zhou, Yaoxin Wu, Zhiguang Cao et al.

Dispatching vehicle fleets to serve flights is a key task in airport ground handling (AGH). Due to the notable growth of flights, it is challenging to simultaneously schedule multiple types of operations (services) for a large number of flights, where each type of operation is performed by one specific vehicle fleet. To tackle this issue, we first represent the operation scheduling as a complex vehicle routing problem and formulate it as a mixed integer linear programming (MILP) model. Then given the graph representation of the MILP model, we propose a learning assisted large neighborhood search (LNS) method using data generated based on real scenarios, where we integrate imitation learning and graph convolutional network (GCN) to learn a destroy operator to automatically select variables, and employ an off-the-shelf solver as the repair operator to reoptimize the selected variables. Experimental results based on a real airport show that the proposed method allows for handling up to 200 flights with 10 types of operations simultaneously, and outperforms state-of-the-art methods. Moreover, the learned method performs consistently accompanying different solvers, and generalizes well on larger instances, verifying the versatility and scalability of our method.

CVMar 15, 2023
Real Face Foundation Representation Learning for Generalized Deepfake Detection

Liang Shi, Jie Zhang, Shiguang Shan

The emergence of deepfake technologies has become a matter of social concern as they pose threats to individual privacy and public security. It is now of great significance to develop reliable deepfake detectors. However, with numerous face manipulation algorithms present, it is almost impossible to collect sufficient representative fake faces, and it is hard for existing detectors to generalize to all types of manipulation. Therefore, we turn to learn the distribution of real faces, and indirectly identify fake images that deviate from the real face distribution. In this study, we propose Real Face Foundation Representation Learning (RFFR), which aims to learn a general representation from large-scale real face datasets and detect potential artifacts outside the distribution of RFFR. Specifically, we train a model on real face datasets by masked image modeling (MIM), which results in a discrepancy between input faces and the reconstructed ones when applying the model on fake samples. This discrepancy reveals the low-level artifacts not contained in RFFR, making it easier to build a deepfake detector sensitive to all kinds of potential artifacts outside the distribution of RFFR. Extensive experiments demonstrate that our method brings about better generalization performance, as it significantly outperforms the state-of-the-art methods in cross-manipulation evaluations, and has the potential to further improve by introducing extra real faces for training RFFR.

CVJul 6, 2023
Contextual Affinity Distillation for Image Anomaly Detection

Jie Zhang, Masanori Suganuma, Takayuki Okatani

Previous works on unsupervised industrial anomaly detection mainly focus on local structural anomalies such as cracks and color contamination. While achieving significantly high detection performance on this kind of anomaly, they are faced with logical anomalies that violate the long-range dependencies such as a normal object placed in the wrong position. In this paper, based on previous knowledge distillation works, we propose to use two students (local and global) to better mimic the teacher's behavior. The local student, which is used in previous studies mainly focuses on structural anomaly detection while the global student pays attention to logical anomalies. To further encourage the global student's learning to capture long-range dependencies, we design the global context condensing block (GCCB) and propose a contextual affinity loss for the student training and anomaly scoring. Experimental results show the proposed method doesn't need cumbersome training techniques and achieves a new state-of-the-art performance on the MVTec LOCO AD dataset.

CVSep 27, 2023Code
Warfare:Breaking the Watermark Protection of AI-Generated Content

Guanlin Li, Yifei Chen, Jie Zhang et al.

AI-Generated Content (AIGC) is rapidly expanding, with services using advanced generative models to create realistic images and fluent text. Regulating such content is crucial to prevent policy violations, such as unauthorized commercialization or unsafe content distribution. Watermarking is a promising solution for content attribution and verification, but we demonstrate its vulnerability to two key attacks: (1) Watermark removal, where adversaries erase embedded marks to evade regulation, and (2) Watermark forging, where they generate illicit content with forged watermarks, leading to misattribution. We propose Warfare, a unified attack framework leveraging a pre-trained diffusion model for content processing and a generative adversarial network for watermark manipulation. Evaluations across datasets and embedding setups show that Warfare achieves high success rates while preserving content quality. We further introduce Warfare-Plus, which enhances efficiency without compromising effectiveness. The code can be found in https://github.com/GuanlinLee/warfare.

CVAug 7, 2024Code
FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Ruiqi Wang, Jinyang Huang, Jie Zhang et al.

Depression is a prevalent mental health disorder that significantly impacts individuals' lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 21% compared to baselines, and the recognition speed increased by 100% compared to state-of-the-art methods. Codes are released at https://github.com/volatileee/FacialPulse.

LGJun 28, 2023
Federated Generative Learning with Foundation Models

Jie Zhang, Xiaohua Qi, Bo Zhao

Existing approaches in Federated Learning (FL) mainly focus on sending model parameters or gradients from clients to a server. However, these methods are plagued by significant inefficiency, privacy, and security concerns. Thanks to the emerging foundation generative models, we propose a novel federated learning framework, namely Federated Generative Learning. In this framework, each client can create text embeddings that are tailored to their local data, and send embeddings to the server. Then the informative training data can be synthesized remotely on the server using foundation generative models with these embeddings, which can benefit FL tasks. Our proposed framework offers several advantages, including increased communication efficiency, robustness to data heterogeneity, substantial performance improvements, and enhanced privacy protection. We validate these benefits through extensive experiments conducted on 12 datasets. For example, on the ImageNet100 dataset with a highly skewed data distribution, our method outperforms FedAvg by 12% in a single communication round, compared to FedAvg's performance over 200 communication rounds. We have released the code for all experiments conducted in this study.

CVNov 29, 2022
Ada3Diff: Defending against 3D Adversarial Point Clouds via Adaptive Diffusion

Kui Zhang, Hang Zhou, Jie Zhang et al.

Deep 3D point cloud models are sensitive to adversarial attacks, which poses threats to safety-critical applications such as autonomous driving. Robust training and defend-by-denoising are typical strategies for defending adversarial perturbations. However, they either induce massive computational overhead or rely heavily upon specified priors, limiting generalized robustness against attacks of all kinds. To remedy it, this paper introduces a novel distortion-aware defense framework that can rebuild the pristine data distribution with a tailored intensity estimator and a diffusion model. To perform distortion-aware forward diffusion, we design a distortion estimation algorithm that is obtained by summing the distance of each point to the best-fitting plane of its local neighboring points, which is based on the observation of the local spatial properties of the adversarial point cloud. By iterative diffusion and reverse denoising, the perturbed point cloud under various distortions can be restored back to a clean distribution. This approach enables effective defense against adaptive attacks with varying noise budgets, enhancing the robustness of existing 3D deep recognition models.

LGNov 20, 2022
Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling

Cong Zhang, Zhiguang Cao, Wen Song et al.

Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin.

LGJul 11, 2023
Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Sikai Bai, Shuaicheng Li, Weiming Zhuang et al.

Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11 on CIFAR-10 and CINIC-10 datasets.

CVJul 24, 2023
Rethinking Data Distillation: Do Not Overlook Calibration

Dongyao Zhu, Bowen Lei, Jie Zhang et al.

Neural networks trained on distilled data often produce over-confident output and require correction by calibration methods. Existing calibration methods such as temperature scaling and mixup work well for networks trained on original large-scale data. However, we find that these methods fail to calibrate networks trained on data distilled from large source datasets. In this paper, we show that distilled data lead to networks that are not calibratable due to (i) a more concentrated distribution of the maximum logits and (ii) the loss of information that is semantically meaningful but unrelated to classification tasks. To address this problem, we propose Masked Temperature Scaling (MTS) and Masked Distillation Training (MDT) which mitigate the limitations of distilled data and achieve better calibration results while maintaining the efficiency of dataset distillation.

MAMar 26, 2022
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning

Jie Zhang, Jun Li, Yijin Zhang et al.

Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting. Aiming to maximize the long-term average achievable system rate, an optimization problem is formulated by jointly designing the transmit beamforming at the base station (BS) and discrete phase shift beamforming at the IRSs, with the constraints on transmit power, user data rate requirement and IRS energy buffer size. Considering time-varying channels and stochastic arrivals of energy harvested by the IRSs, we first formulate the problem as a Markov decision process (MDP) and then develop a novel multi-agent Q-mix (MAQ) framework with two layers to decouple the optimization parameters. The higher layer is for optimizing phase shift resolutions, and the lower one is for phase shift beamforming and power allocation. Since the phase shift optimization is an integer programming problem with a large-scale action space, we improve MAQ by incorporating the Wolpertinger method, namely, MAQ-WP algorithm to achieve a sub-optimality with reduced dimensions of action space. In addition, as MAQ-WP is still of high complexity to achieve good performance, we propose a policy gradient-based MAQ algorithm, namely, MAQ-PG, by mapping the discrete phase shift actions into a continuous space at the cost of a slight performance loss. Simulation results demonstrate that the proposed MAQ-WP and MAQ-PG algorithms can converge faster and achieve data rate improvements of 10.7% and 8.8% over the conventional multi-agent DDPG, respectively.

CLMay 31
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

Yubo Gao, Haotian Wu, Hong Chen et al.

Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.

CRAug 21, 2022
Fed-FSNet: Mitigating Non-I.I.D. Federated Learning via Fuzzy Synthesizing Network

Jingcai Guo, Song Guo, Jie Zhang et al.

Federated learning (FL) has emerged as a promising privacy-preserving distributed machine learning framework recently. It aims at collaboratively learning a shared global model by performing distributed training locally on edge devices and aggregating local models into a global one without centralized raw data sharing in the cloud server. However, due to the large local data heterogeneities (Non-I.I.D. data) across edge devices, the FL may easily obtain a global model that can produce more shifted gradients on local datasets, thereby degrading the model performance or even suffering from the non-convergence during training. In this paper, we propose a novel FL training framework, dubbed Fed-FSNet, using a properly designed Fuzzy Synthesizing Network (FSNet) to mitigate the Non-I.I.D. FL at-the-source. Concretely, we maintain an edge-agnostic hidden model in the cloud server to estimate a less-accurate while direction-aware inversion of the global model. The hidden model can then fuzzily synthesize several mimic I.I.D. data samples (sample features) conditioned on only the global model, which can be shared by edge devices to facilitate the FL training towards faster and better convergence. Moreover, since the synthesizing process involves neither access to the parameters/updates of local models nor analyzing individual local model outputs, our framework can still ensure the privacy of FL. Experimental results on several FL benchmarks demonstrate that our method can significantly mitigate the Non-I.I.D. issue and obtain better performance against other representative methods.

CVJul 27, 2023
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Lingdong Kong, Yaru Niu, Shaoyuan Xie et al.

Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.

SEDec 20, 2022
Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

Wei Ma, Shangqing Liu, Mengjie Zhao et al.

Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution of attention weights concerning code semantic structures. Through detailed analysis, our results emphasize the strengths and weaknesses of various code models in mastering code syntax and semantics. The findings reveal that these models are proficient in grasping code syntax, effectively capturing the relationships and roles of syntax tokens. However, their ability to encode code semantics shows more variability. This study enriches our understanding of the capabilities of code models in analyzing syntax and semantics. Our findings offer valuable insights for future code model enhancements, helping optimize their application across a range of code-related tasks.

IRJul 19, 2023
Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?

Yu-chen Fan, Yitong Ji, Jie Zhang et al.

A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g., like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis of the MovieLens dataset and explain the potential impact of using the dataset for evaluating recommendation algorithms. We make a few main findings from our analysis. First, there are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. The early interactions largely define the user portrait which affects the subsequent interactions. Second, user interactions are highly affected by the candidate movies that are recommended by the platform's internal recommendation algorithm(s). Third, changing the order of user interactions makes it more difficult for sequential algorithms to capture the progressive interaction process. We further discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real-world recommendation scenarios. In summary, the MovieLens platform demonstrates an efficient and effective way of collecting user preferences to address cold-starts. However, models that achieve excellent recommendation accuracy on the MovieLens dataset may not demonstrate superior performance in practice, for at least two kinds of differences: (i) the differences in the contexts of user-item interaction generation, and (ii) the differences in user knowledge about the item collections. While results on MovieLens can be useful as a reference, they should not be solely relied upon as the primary justification for the effectiveness of a recommendation system model.

LGApr 14, 2022
Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

Feijie Wu, Shiqi He, Song Guo et al.

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.

CVSep 3, 2024Code
Shuffle Mamba: State Space Models with Random Shuffle for Multi-Modal Image Fusion

Ke Cao, Xuanhua He, Tao Hu et al.

Multi-modal image fusion integrates complementary information from different modalities to produce enhanced and informative images. Although State-Space Models, such as Mamba, are proficient in long-range modeling with linear complexity, most Mamba-based approaches use fixed scanning strategies, which can introduce biased prior information. To mitigate this issue, we propose a novel Bayesian-inspired scanning strategy called Random Shuffle, supplemented by a theoretically feasible inverse shuffle to maintain information coordination invariance, aiming to eliminate biases associated with fixed sequence scanning. Based on this transformation pair, we customized the Shuffle Mamba Framework, penetrating modality-aware information representation and cross-modality information interaction across spatial and channel axes to ensure robust interaction and an unbiased global receptive field for multi-modal image fusion. Furthermore, we develop a testing methodology based on Monte-Carlo averaging to ensure the model's output aligns more closely with expected results. Extensive experiments across multiple multi-modal image fusion tasks demonstrate the effectiveness of our proposed method, yielding excellent fusion quality compared to state-of-the-art alternatives. The code is available at https://github.com/caoke-963/Shuffle-Mamba.

CVJul 31, 2023
Sampling to Distill: Knowledge Transfer from Open-World Data

Yuzheng Wang, Zhaoyu Chen, Jie Zhang et al.

Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the pre-trained teacher network without original training data. Most of the existing DFKD methods rely heavily on additional generation modules to synthesize the substitution data resulting in high computational costs and ignoring the massive amounts of easily accessible, low-cost, unlabeled open-world data. Meanwhile, existing methods ignore the domain shift issue between the substitution data and the original data, resulting in knowledge from teachers not always trustworthy and structured knowledge from data becoming a crucial supplement. To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module and introduce a low-noise representation to alleviate the domain shift issue. Then, we build structured relationships of multiple data examples to exploit data knowledge through the student model itself and the teacher's structured representation. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance with lower FLOPs and parameters. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset and avoid training the separate generator for each class.

IVJan 20, 2023
DiffusionCT: Latent Diffusion Model for CT Image Standardization

Md Selim, Jie Zhang, Michael A. Brooks et al.

Computed tomography (CT) is one of the modalities for effective lung cancer screening, diagnosis, treatment, and prognosis. The features extracted from CT images are now used to quantify spatial and temporal variations in tumors. However, CT images obtained from various scanners with customized acquisition protocols may introduce considerable variations in texture features, even for the same patient. This presents a fundamental challenge to downstream studies that require consistent and reliable feature analysis. Existing CT image harmonization models rely on GAN-based supervised or semi-supervised learning, with limited performance. This work addresses the issue of CT image harmonization using a new diffusion-based model, named DiffusionCT, to standardize CT images acquired from different vendors and protocols. DiffusionCT operates in the latent space by mapping a latent non-standard distribution into a standard one. DiffusionCT incorporates an Unet-based encoder-decoder, augmented by a diffusion model integrated into the bottleneck part. The model is designed in two training phases. The encoder-decoder is first trained, without embedding the diffusion model, to learn the latent representation of the input data. The latent diffusion model is then trained in the next training phase while fixing the encoder-decoder. Finally, the decoder synthesizes a standardized image with the transformed latent representation. The experimental results demonstrate a significant improvement in the performance of the standardization task using DiffusionCT.

DBNov 13, 2022
Reinforcement Learning Enhanced Weighted Sampling for Accurate Subgraph Counting on Fully Dynamic Graph Streams

Kaixin Wang, Cheng Long, Da Yan et al.

As the popularity of graph data increases, there is a growing need to count the occurrences of subgraph patterns of interest, for a variety of applications. Many graphs are massive in scale and also fully dynamic (with insertions and deletions of edges), rendering exact computation of these counts to be infeasible. Common practice is, instead, to use a small set of edges as a sample to estimate the counts. Existing sampling algorithms for fully dynamic graphs sample the edges with uniform probability. In this paper, we show that we can do much better if we sample edges based on their individual properties. Specifically, we propose a weighted sampling algorithm called WSD for estimating the subgraph count in a fully dynamic graph stream, which samples the edges based on their weights that indicate their importance and reflect their properties. We determine the weights of edges in a data-driven fashion, using a novel method based on reinforcement learning. We conduct extensive experiments to verify that our technique can produce estimates with smaller errors while often running faster compared with existing algorithms.

CVAug 21, 2023
Patch Is Not All You Need

Changzhen Li, Jie Zhang, Yang Wei et al.

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.

CVJul 6, 2023
That's BAD: Blind Anomaly Detection by Implicit Local Feature Clustering

Jie Zhang, Masanori Suganuma, Takayuki Okatani

Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (\textit{i.e.}, anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalies in a given set of images that might contain both normal and anomalous samples. The setting does not assume the availability of known normal data and thus is completely free from human annotation, which differs from the standard AD considered in recent studies. For clarity, we call the setting blind anomaly detection (BAD). We show that BAD can be converted into a local outlier detection problem and propose a novel method named PatchCluster that can accurately detect image- and pixel-level anomalies. Experimental results show that PatchCluster shows a promising performance without the knowledge of normal data, even comparable to the SOTA methods applied in the one-class setting needing it.

CVNov 13, 2025Code
When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye, Wei Zeng, Meng Liu et al.

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

MED-PHJun 10, 2011
Omni-tomography/Multi-tomography -- Integrating Multiple Modalities for Simultaneous Imaging

Ge Wang, Jie Zhang, Hao Gao et al.

Current tomographic imaging systems need major improvements, especially when multi-dimensional, multi-scale, multi-temporal and multi-parametric phenomena are under investigation. Both preclinical and clinical imaging now depend on in vivo tomography, often requiring separate evaluations by different imaging modalities to define morphologic details, delineate interval changes due to disease or interventions, and study physiological functions that have interconnected aspects. Over the past decade, fusion of multimodality images has emerged with two different approaches: post-hoc image registration and combined acquisition on PET-CT, PET-MRI and other hybrid scanners. There are intrinsic limitations for both the post-hoc image analysis and dual/triple modality approaches defined by registration errors and physical constraints in the acquisition chain. We envision that tomography will evolve beyond current modality fusion and towards grand fusion, a large scale fusion of all or many imaging modalities, which may be referred to as omni-tomography or multi-tomography. Unlike modality fusion, grand fusion is here proposed for truly simultaneous but often localized reconstruction in terms of all or many relevant imaging mechanisms such as CT, MRI, PET, SPECT, US, optical, and possibly more. In this paper, the technical basis for omni-tomography is introduced and illustrated with a top-level design of a next generation scanner, interior tomographic reconstructions of representative modalities, and anticipated applications of omni-tomography.

CVMar 12Code
INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Junqi Yang, Yuecong Min, Jie Zhang et al.

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

LGJun 2, 2023
On Knowledge Editing in Federated Learning: Perspectives, Challenges, and Future Directions

Leijie Wu, Song Guo, Junxiao Wang et al.

As Federated Learning (FL) has gained increasing attention, it has become widely acknowledged that straightforwardly applying stochastic gradient descent (SGD) on the overall framework when learning over a sequence of tasks results in the phenomenon known as ``catastrophic forgetting''. Consequently, much FL research has centered on devising federated increasing learning methods to alleviate forgetting while augmenting knowledge. On the other hand, forgetting is not always detrimental. The selective amnesia, also known as federated unlearning, which entails the elimination of specific knowledge, can address privacy concerns and create additional ``space'' for acquiring new knowledge. However, there is a scarcity of extensive surveys that encompass recent advancements and provide a thorough examination of this issue. In this manuscript, we present an extensive survey on the topic of knowledge editing (augmentation/removal) in Federated Learning, with the goal of summarizing the state-of-the-art research and expanding the perspective for various domains. Initially, we introduce an integrated paradigm, referred to as Federated Editable Learning (FEL), by reevaluating the entire lifecycle of FL. Secondly, we provide a comprehensive overview of existing methods, evaluate their position within the proposed paradigm, and emphasize the current challenges they face. Lastly, we explore potential avenues for future research and identify unresolved issues.

LGNov 21, 2022
Exploiting Personalized Invariance for Better Out-of-distribution Generalization in Federated Learning

Xueyang Tang, Song Guo, Jie Zhang

Recently, data heterogeneity among the training datasets on the local clients (a.k.a., Non-IID data) has attracted intense interest in Federated Learning (FL), and many personalized federated learning methods have been proposed to handle it. However, the distribution shift between the training dataset and testing dataset on each client is never considered in FL, despite it being general in real-world scenarios. We notice that the distribution shift (a.k.a., out-of-distribution generalization) problem under Non-IID federated setting becomes rather challenging due to the entanglement between personalized and spurious information. To tackle the above problem, we elaborate a general dual-regularized learning framework to explore the personalized invariance, compared with the exsiting personalized federated learning methods which are regularized by a single baseline (usually the global model). Utilizing the personalized invariant features, the developed personalized models can efficiently exploit the most relevant information and meanwhile eliminate spurious information so as to enhance the out-of-distribution generalization performance for each client. Both the theoretical analysis on convergence and OOD generalization performance and the results of extensive experiments demonstrate the superiority of our method over the existing federated learning and invariant learning methods, in diverse out-of-distribution and Non-IID data cases.

LGNov 14, 2022
Feature Correlation-guided Knowledge Transfer for Federated Self-supervised Learning

Yi Liu, Song Guo, Jie Zhang et al.

To eliminate the requirement of fully-labeled data for supervised model training in traditional Federated Learning (FL), extensive attention has been paid to the application of Self-supervised Learning (SSL) approaches on FL to tackle the label scarcity problem. Previous works on Federated SSL generally fall into two categories: parameter-based model aggregation (i.e., FedAvg, applicable to homogeneous cases) or data-based feature sharing (i.e., knowledge distillation, applicable to heterogeneous cases) to achieve knowledge transfer among multiple unlabeled clients. Despite the progress, all of them inevitably rely on some assumptions, such as homogeneous models or the existence of an additional public dataset, which hinder the universality of the training frameworks for more general scenarios. Therefore, in this paper, we propose a novel and general method named Federated Self-supervised Learning with Feature-correlation based Aggregation (FedFoA) to tackle the above limitations in a communication-efficient and privacy-preserving manner. Our insight is to utilize feature correlation to align the feature mappings and calibrate the local model updates across clients during their local training process. More specifically, we design a factorization-based method to extract the cross-feature relation matrix from the local representations. Then, the relation matrix can be regarded as a carrier of semantic information to perform the aggregation phase. We prove that FedFoA is a model-agnostic training framework and can be easily compatible with state-of-the-art unsupervised FL methods. Extensive empirical experiments demonstrate that our proposed approach outperforms the state-of-the-art methods by a significant margin.