LGMar 6, 2023Code
Rethinking Confidence Calibration for Failure PredictionFei Zhu, Zhen Cheng, Xu-Yao Zhang et al.
Reliable confidence estimation for the predictions is important in many safety-critical applications. However, modern deep neural networks are often overconfident for their incorrect predictions. Recently, many calibration methods have been proposed to alleviate the overconfidence problem. With calibrated confidence, a primary and practical purpose is to detect misclassification errors by filtering out low-confidence predictions (known as failure prediction). In this paper, we find a general, widely-existed but actually-neglected phenomenon that most confidence calibration methods are useless or harmful for failure prediction. We investigate this problem and reveal that popular confidence calibration methods often lead to worse confidence separation between correct and incorrect samples, making it more difficult to decide whether to trust a prediction or not. Finally, inspired by the natural connection between flat minima and confidence separation, we propose a simple hypothesis: flat minima is beneficial for failure prediction. We verify this hypothesis via extensive experiments and further boost the performance by combining two different flat minima techniques. Our code is available at https://github.com/Impression2805/FMFP
LGMar 30, 2023Code
OpenMix: Exploring Outlier Samples for Misclassification DetectionFei Zhu, Zhen Cheng, Xu-Yao Zhang et al.
Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes. The code is publicly available at https://github.com/Impression2805/OpenMix.
LGMar 21, 2023Code
Dynamics-Aware Loss for Learning with Label NoiseXiu-Chuan Li, Xiaobo Xia, Fei Zhu et al.
Label noise poses a serious threat to deep neural networks (DNNs). Employing robust loss functions which reconcile fitting ability with robustness is a simple but effective strategy to handle this problem. However, the widely-used static trade-off between these two factors contradicts the dynamics of DNNs learning with label noise, leading to inferior performance. Therefore, we propose a dynamics-aware loss (DAL) to solve this problem. Considering that DNNs tend to first learn beneficial patterns, then gradually overfit harmful label noise, DAL strengthens the fitting ability initially, then gradually improves robustness. Moreover, at the later stage, to further reduce the negative impact of label noise and combat underfitting simultaneously, we let DNNs put more emphasis on easy examples than hard ones and introduce a bootstrapping term. Both the detailed theoretical analyses and extensive experimental results demonstrate the superiority of our method. Our source code can be found in https://github.com/XiuchuanLi/DAL.
LGJul 18, 2023Code
Towards Trustworthy Dataset DistillationShijie Ma, Fei Zhu, Zhen Cheng et al.
Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD
CVSep 7, 2024Code
Local-Prompt: Extensible Local Prompts for Few-Shot Out-of-Distribution DetectionFanhu Zeng, Zhen Cheng, Fei Zhu et al.
Out-of-Distribution (OOD) detection, aiming to distinguish outliers from known categories, has gained prominence in practical scenarios. Recently, the advent of vision-language models (VLM) has heightened interest in enhancing OOD detection for VLM through few-shot tuning. However, existing methods mainly focus on optimizing global prompts, ignoring refined utilization of local information with regard to outliers. Motivated by this, we freeze global prompts and introduce Local-Prompt, a novel coarse-to-fine tuning paradigm to emphasize regional enhancement with local prompts. Our method comprises two integral components: global prompt guided negative augmentation and local prompt enhanced regional regularization. The former utilizes frozen, coarse global prompts as guiding cues to incorporate negative augmentation, thereby leveraging local outlier knowledge. The latter employs trainable local prompts and a regional regularization to capture local information effectively, aiding in outlier identification. We also propose regional-related metric to empower the enrichment of OOD detection. Moreover, since our approach explores enhancing local prompts only, it can be seamlessly integrated with trained global prompts during inference to boost the performance. Comprehensive experiments demonstrate the effectiveness and potential of our method. Notably, our method reduces average FPR95 by 5.17% against state-of-the-art method in 4-shot tuning on challenging ImageNet-1k dataset, even outperforming 16-shot results of previous methods. Code is released at https://github.com/AuroraZengfh/Local-Prompt.
CVAug 4, 2023
Class Incremental Learning with Self-Supervised Pre-Training and Prototype LearningWenzhuo Liu, Xinjian Wu, Fei Zhu et al.
Deep Neural Network (DNN) has achieved great success on datasets of closed class set. However, new classes, like new categories of social media topics, are continuously added to the real world, making it necessary to incrementally learn. This is hard for DNN because it tends to focus on fitting to new classes while ignoring old classes, a phenomenon known as catastrophic forgetting. State-of-the-art methods rely on knowledge distillation and data replay techniques but still have limitations. In this work, we analyze the causes of catastrophic forgetting in class incremental learning, which owes to three factors: representation drift, representation confusion, and classifier distortion. Based on this view, we propose a two-stage learning framework with a fixed encoder and an incrementally updated prototype classifier. The encoder is trained with self-supervised learning to generate a feature space with high intrinsic dimensionality, thus improving its transferability and generality. The classifier incrementally learns new prototypes while retaining the prototypes of previously learned data, which is crucial in preserving the decision boundary.Our method does not rely on preserved samples of old classes, is thus a non-exemplar based CIL method. Experiments on public datasets show that our method can significantly outperform state-of-the-art exemplar-based methods when they reserved 5 examplers per class, under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.
LGMar 2, 2023
Average of Pruning: Improving Performance and Stability of Out-of-Distribution DetectionZhen Cheng, Fei Zhu, Xu-Yao Zhang et al.
Detecting Out-of-distribution (OOD) inputs have been a critical issue for neural networks in the open world. However, the unstable behavior of OOD detection along the optimization trajectory during training has not been explored clearly. In this paper, we first find the performance of OOD detection suffers from overfitting and instability during training: 1) the performance could decrease when the training error is near zero, and 2) the performance would vary sharply in the final stage of training. Based on our findings, we propose Average of Pruning (AoP), consisting of model averaging and pruning, to mitigate the unstable behaviors. Specifically, model averaging can help achieve a stable performance by smoothing the landscape, and pruning is certified to eliminate the overfitting by eliminating redundant features. Comprehensive experiments on various datasets and architectures are conducted to verify the effectiveness of our method.
CVJul 19, 2024
PASS++: A Dual Bias Reduction Framework for Non-Exemplar Class-Incremental LearningFei Zhu, Xu-Yao Zhang, Zhen Cheng et al.
Class-incremental learning (CIL) aims to recognize new classes incrementally while maintaining the discriminability of old classes. Most existing CIL methods are exemplar-based, i.e., storing a part of old data for retraining. Without relearning old data, those methods suffer from catastrophic forgetting. In this paper, we figure out two inherent problems in CIL, i.e., representation bias and classifier bias, that cause catastrophic forgetting of old knowledge. To address these two biases, we present a simple and novel dual bias reduction framework that employs self-supervised transformation (SST) in input space and prototype augmentation (protoAug) in deep feature space. On the one hand, SST alleviates the representation bias by learning generic and diverse representations that can transfer across different tasks. On the other hand, protoAug overcomes the classifier bias by explicitly or implicitly augmenting prototypes of old classes in the deep feature space, which poses tighter constraints to maintain previously learned decision boundaries. We further propose hardness-aware prototype augmentation and multi-view ensemble strategies, leading to significant improvements. The proposed framework can be easily integrated with pre-trained models. Without storing any samples of old classes, our method can perform comparably with state-of-the-art exemplar-based approaches which store plenty of old data. We hope to draw the attention of researchers back to non-exemplar CIL by rethinking the necessity of storing old samples in CIL.
87.1GRMay 13Code
BlitzGS: City-Scale Gaussian Splatting at Lightning SpeedZhongtao Wang, Huishan Au, Yilong Li et al.
We present BlitzGS, a distributed 3DGS framework that reduces active Gaussian workload for fast city-scale reconstruction. BlitzGS manages this workload at three coupled levels. At the system level, the framework shards Gaussians across GPUs by index parity rather than spatial blocks. This approach mitigates the cross-block visibility redundancy inherent in spatial partitioning. Furthermore, it distributes each rendering step through a single cross-GPU exchange that routes projected Gaussians to their tile owners. At the model level, scheduled importance-scoring passes shrink the global Gaussian population. During these passes, the framework generates a per-Gaussian visibility weight to bias density-control updates toward contributing primitives and a per-view importance mask for the view-level renderer. At the view level, BlitzGS trims each camera's active set with a distance-based LOD gate to exclude excessively fine primitives for the current frustum and the importance-based culling mask to skip Gaussians with negligible cross-view contribution. On large-scale benchmarks, BlitzGS matches the rendering quality of recent large-scale baselines while delivering an order-of-magnitude speedup, training city-scale scenes in tens of minutes. Our code is available at https: //github.com/AkierRaee/BlitzGS.
20.9SDApr 19
SonicRadiation: A Hybrid Numerical Solution for Sound Radiation without Ghost CellsXutong Jin, Fei Zhu, Guoping Wang et al.
Interactive synthesis of physical sound effects is crucial in digital media production. Sound radiation simulation, a key component of physically based sound synthesis, has posed challenges in the context of complex object boundaries. Previous methods, such as ghost cell-based finite-difference time-domain (FDTD) wave solver, have struggled to address these challenges, leading to large errors and failures in complex boundaries because of the limitation of ghost cells. We present SonicRadiation, a hybrid numerical solution capable of handling complex and dynamic object boundaries in sound radiation simulation without relying on ghost cells. We derive a consistent formulation to connect the physical quantities on grid cells in FDTD with the boundary elements in the time-domain boundary element method (TDBEM). Hereby, we propose a boundary grid synchronization strategy to seamlessly integrate TDBEM with FDTD while maintaining high numerical accuracy. Our method holds both advantages from the accuracy of TDBEM for the near-field and the efficiency of FDTD for the far-field. Experimental results demonstrate the superiority of our method in sound radiation simulation over previous approaches in terms of accuracy and efficiency, particularly in complex scenes, further validating its effectiveness.
CVMar 7, 2024Code
Active Generalized Category DiscoveryShijie Ma, Fei Zhu, Zhun Zhong et al.
Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task, which endeavors to cluster unlabeled samples from both novel and old classes, leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes, and that novel categories are fully unlabeled, GCD inherently faces intractable problems, including imbalanced classification performance and inconsistent confidence between old and new classes, especially in the low-labeling regime. Hence, some annotations of new classes are deemed necessary. However, labeling new classes is extremely costly. To address this issue, we take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD). The goal is to improve the performance of GCD by actively selecting a limited amount of valuable samples for labeling from the oracle. To solve this problem, we devise an adaptive sampling strategy, which jointly considers novelty, informativeness and diversity to adaptively select novel samples with proper uncertainty. However, owing to the varied orderings of label indices caused by the clustering of novel classes, the queried labels are not directly applicable to subsequent training. To overcome this issue, we further propose a stable label mapping algorithm that transforms ground truth labels to the label space of the classifier, thereby ensuring consistent training across different active selection stages. Our method achieves state-of-the-art performance on both generic and fine-grained datasets. Our code is available at https://github.com/mashijie1028/ActiveGCD
CVJan 4, 2024Code
PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental LearningHaiyang Guo, Fei Zhu, Wenzhuo Liu et al.
Existing federated learning methods have effectively dealt with decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to classify all seen classes. To effectively mitigate catastrophic forgetting and data heterogeneity under low communication costs, we propose a simple and effective method named PILoRA. On the one hand, we adopt prototype learning to learn better feature representations and leverage the heuristic information between prototypes and class features to design a prototype re-weight module to solve the classifier bias caused by data heterogeneity without retraining the classifier. On the other hand, we view incremental learning as the process of learning distinct task vectors and encoding them within different LoRA parameters. Accordingly, we propose Incremental LoRA to mitigate catastrophic forgetting. Experimental results on standard datasets indicate that our method outperforms the state-of-the-art approaches significantly. More importantly, our method exhibits strong robustness and superiority in different settings and degrees of data heterogeneity. The code is available at \url{https://github.com/Ghy0501/PILoRA}.
CVMar 18, 2024Code
Continual Forgetting for Pre-trained Vision ModelsHongbo Zhao, Bolin Ni, Haochen Wang et al.
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{https://github.com/bjzhb666/GS-LoRA}.
CVMar 5, 2024Code
Revisiting Confidence Estimation: Towards Reliable Failure PredictionFei Zhu, Xu-Yao Zhang, Zhen Cheng et al.
Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}.
CVMar 14, 2023
Nonlinear Hyperspectral Unmixing based on Multilinear Mixing Model using Convolutional AutoencodersTingting Fang, Fei Zhu, Jie Chen
Unsupervised spectral unmixing consists of representing each observed pixel as a combination of several pure materials called endmembers with their corresponding abundance fractions. Beyond the linear assumption, various nonlinear unmixing models have been proposed, with the associated optimization problems solved either by traditional optimization algorithms or deep learning techniques. Current deep learning-based nonlinear unmixing focuses on the models in additive, bilinear-based formulations. By interpreting the reflection process using the discrete Markov chain, the multilinear mixing model (MLM) successfully accounts for the up to infinite-order interactions between endmembers. However, to simulate the physics process of MLM by neural networks explicitly is a challenging problem that has not been approached by far. In this article, we propose a novel autoencoder-based network for unsupervised unmixing based on MLM. Benefitting from an elaborate network design, the relationships among all the model parameters {\em i.e.}, endmembers, abundances, and transition probability parameters are explicitly modeled. There are two modes: MLM-1DAE considers only pixel-wise spectral information, and MLM-3DAE exploits the spectral-spatial correlations within input patches. Experiments on both the synthetic and real datasets demonstrate the effectiveness of the proposed method as it achieves competitive performance to the classic solutions of MLM.
LGApr 2, 2025Code
ProtoGCD: Unified and Unbiased Prototype Learning for Generalized Category DiscoveryShijie Ma, Fei Zhu, Xu-Yao Zhang et al.
Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, {enabling unified modeling between old and new classes}. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at https://github.com/mashijie1028/ProtoGCD.
CLMar 17, 2025Code
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language ModelHaiyang Guo, Fanhu Zeng, Ziwei Xiang et al.
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.
CLJun 5, 2025Code
MLLM-CL: Continual Learning for Multimodal Large Language ModelsHongbo Zhao, Fei Zhu, Haiyang Guo et al.
Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with new model abilities. Methodologically, we propose preventing catastrophic interference through parameter isolation and an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods. Our benchmark and code are available at https://github.com/bjzhb666/MLLM-CL.
LGMar 17, 2025Code
Federated Continual Instruction TuningHaiyang Guo, Fanhu Zeng, Fei Zhu et al.
A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.
CVFeb 24, 2025Code
RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction RobustnessFanhu Zeng, Haiyang Guo, Fei Zhu et al.
Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.
LGNov 28, 2024Code
DESIRE: Dynamic Knowledge Consolidation for Rehearsal-Free Continual LearningHaiyang Guo, Fei Zhu, Fanhu Zeng et al.
Continual learning aims to equip models with the ability to retain previously learned knowledge like a human. Recent work incorporating Parameter-Efficient Fine-Tuning has revitalized the field by introducing lightweight extension modules. However, existing methods usually overlook the issue of information leakage caused by the fact that the experiment data have been used in pre-trained models. Once these duplicate data are removed in the pre-training phase, their performance can be severely affected. In this paper, we propose a new LoRA-based rehearsal-free method named DESIRE. Our method avoids imposing additional constraints during training to mitigate catastrophic forgetting, thereby maximizing the learning of new classes. To integrate knowledge from old and new tasks, we propose two efficient post-processing modules. On the one hand, we retain only two sets of LoRA parameters for merging and propose dynamic representation consolidation to calibrate the merged feature representation. On the other hand, we propose decision boundary refinement to address classifier bias when training solely on new class data. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple datasets and strikes an effective balance between stability and plasticity. Our code will be publicly available.
92.5LGMay 17
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR EraQiuhe Hong, Yuyang Liu, Shuo Yang et al.
Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.
LGJun 16, 2025Code
Continual Learning for Generative AI: From LLMs to MLLMs and BeyondHaiyang Guo, Fanhu Zeng, Fei Zhu et al.
The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
CVAug 10, 2025Code
MCITlib: Multimodal Continual Instruction Tuning Library and BenchmarkHaiyang Guo, Fei Zhu, Hongbo Zhao et al.
Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.
75.8CVApr 1
CL-VISTA: Benchmarking Continual Learning in Video Large Language ModelsHaiyang Guo, Yichen Shi, Fei Zhu et al.
Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.
CVJan 16, 2025Code
Practical Continual Forgetting for Pre-trained Vision ModelsHongbo Zhao, Fei Zhu, Bolin Ni et al.
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.
17.6CVApr 13
A Deep Equilibrium Network for Hyperspectral UnmixingChentong Wang, Jincheng Gao, Fei Zhu et al.
Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.
CVDec 17, 2025Code
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?Hongbo Zhao, Meng Wang, Fei Zhu et al.
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-processed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
GRNov 24, 2025Code
ChronoGS: Disentangling Invariants and Changes in Multi-Period ScenesZhongtao Wang, Jiaqi Dai, Qingtian Zhu et al.
Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS.
LGMar 4, 2024
Open-world machine learning: A review and new outlooksFei Zhu, Shijie Ma, Zhen Cheng et al.
Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In such cases, rejecting unknowns, discovering novelties, and then continually learning them, could enable models to be safe and evolve continually as biological systems do. This article presents a holistic view of open-world machine learning by investigating unknown rejection, novelty discovery, and continual learning in a unified paradigm. The challenges, principles, and limitations of current methodologies are discussed in detail. Furthermore, widely used benchmarks, metrics, and performances are summarized. Finally, we discuss several potential directions for further progress in the field. By providing a comprehensive introduction to the emerging open-world machine learning paradigm, this article aims to help researchers build more powerful AI systems in their respective fields, and to promote the development of artificial general intelligence.
LGJul 7, 2025
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-TrainingSong Lai, Haohan Zhao, Rong Feng et al.
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
31.4CVApr 27
Unconstrained Multi-view Human Pose Estimation with Algebraic PriorsXiaolin Qin, Qianlei Wang, Jiacen Liu et al.
Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gröbner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.
72.1ROApr 1
C-NAV: Towards Self-Evolving Continual Object Navigation in Open WorldMing-Ming Yu, Fei Zhu, Wenzhuo Liu et al.
Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at https://bigtree765.github.io/C-Nav-project.
CVJun 10, 2025
LLaVA-c: Continual Improved Visual Instruction TuningWenzhuo Liu, Fei Zhu, Haiyang Guo et al.
Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.
CVMar 27, 2024
Towards Non-Exemplar Semi-Supervised Class-Incremental LearningWenzhuo Liu, Fei Zhu, Cheng-Lin Liu
Deep neural networks perform remarkably well in close-world scenarios. However, novel classes emerged continually in real applications, making it necessary to learn incrementally. Class-incremental learning (CIL) aims to gradually recognize new classes while maintaining the discriminability of old ones. Existing CIL methods have two limitations: a heavy reliance on preserving old data for forgetting mitigation and the need for vast labeled data for knowledge adaptation. To overcome these issues, we propose a non-exemplar semi-supervised CIL framework with contrastive learning and semi-supervised incremental prototype classifier (Semi-IPC). On the one hand, contrastive learning helps the model learn rich representations, easing the trade-off between learning representations of new classes and forgetting that of old classes. On the other hand, Semi-IPC learns a prototype for each class with unsupervised regularization, enabling the model to incrementally learn from partially labeled new data while maintaining the knowledge of old classes. Experiments on benchmark datasets demonstrate the strong performance of our method: without storing any old samples and only using less than 1% of labels, Semi-IPC outperforms advanced exemplar-based methods. We hope our work offers new insights for future CIL research. The code will be made publicly available.
LGMar 27, 2024
Branch-Tuning: Balancing Stability and Plasticity for Continual Self-Supervised LearningWenzhuo Liu, Fei Zhu, Cheng-Lin Liu
Self-supervised learning (SSL) has emerged as an effective paradigm for deriving general representations from vast amounts of unlabeled data. However, as real-world applications continually integrate new content, the high computational and resource demands of SSL necessitate continual learning rather than complete retraining. This poses a challenge in striking a balance between stability and plasticity when adapting to new information. In this paper, we employ Centered Kernel Alignment for quantitatively analyzing model stability and plasticity, revealing the critical roles of batch normalization layers for stability and convolutional layers for plasticity. Motivated by this, we propose Branch-tuning, an efficient and straightforward method that achieves a balance between stability and plasticity in continual SSL. Branch-tuning consists of branch expansion and compression, and can be easily applied to various SSL methods without the need of modifying the original methods, retaining old data or models. We validate our method through incremental experiments on various benchmark datasets, demonstrating its effectiveness and practical value in real-world scenarios. We hope our work offers new insights for future continual self-supervised learning research. The code will be made publicly available.
LGMar 30, 2025
Pareto Continual Learning: Preference-Conditioned Learning and Adaption for Dynamic Stability-Plasticity Trade-offSong Lai, Zhe Zhao, Fei Zhu et al.
Continual learning aims to learn multiple tasks sequentially. A key challenge in continual learning is balancing between two objectives: retaining knowledge from old tasks (stability) and adapting to new tasks (plasticity). Experience replay methods, which store and replay past data alongside new data, have become a widely adopted approach to mitigate catastrophic forgetting. However, these methods neglect the dynamic nature of the stability-plasticity trade-off and aim to find a fixed and unchanging balance, resulting in suboptimal adaptation during training and inference. In this paper, we propose Pareto Continual Learning (ParetoCL), a novel framework that reformulates the stability-plasticity trade-off in continual learning as a multi-objective optimization (MOO) problem. ParetoCL introduces a preference-conditioned model to efficiently learn a set of Pareto optimal solutions representing different trade-offs and enables dynamic adaptation during inference. From a generalization perspective, ParetoCL can be seen as an objective augmentation approach that learns from different objective combinations of stability and plasticity. Extensive experiments across multiple datasets and settings demonstrate that ParetoCL outperforms state-of-the-art methods and adapts to diverse continual learning scenarios.
CVMar 26, 2025
Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language ModelsFanhu Zeng, Zhen Cheng, Fei Zhu et al.
Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.
LGApr 20, 2025
TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution DataFei Zhu, Zhaoxiang Zhang
Reliable prediction is an essential requirement for deep neural models that are deployed in open environments, where both covariate and semantic out-of-distribution (OOD) data arise naturally. In practice, to make safe decisions, a reliable model should accept correctly recognized inputs while rejecting both those misclassified covariate-shifted and semantic-shifted examples. Besides, considering the potential existing trade-off between rejecting different failure cases, more convenient, controllable, and flexible failure detection approaches are needed. To meet the above requirements, we propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts. Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters and then integrating them, we can enhance the failure detection ability effectively and flexibly. Extensive experiments demonstrate the superiority of our framework.
LGFeb 2
Self-Consolidation for Self-Evolving AgentsHongzhuo Yu, Fei Zhu, Guo-Sen Xie et al.
While large language model (LLM) agents have demonstrated impressive problem-solving capabilities, they typically operate as static systems, lacking the ability to evolve through lifelong interaction. Existing attempts to bridge this gap primarily rely on retrieving successful past trajectories as demonstrations. However, this paradigm faces two critical limitations. First, by focusing solely on success, agents overlook the rich pedagogical value embedded in failed attempts, preventing them from identifying and avoiding recurrent pitfalls. Second, continually accumulating textual experiences not only increases the time consumption during retrieval but also inevitably introduces noise and exhausts the largest context window of current LLMs. To address these challenges, we propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism: First, a contrastive reflection strategy is introduced to explicitly summarize error-prone patterns and capture reusable insights. Second, we propose a self-consolidation mechanism that distills non-parametric textual experience into compact learnable parameters. This enables the agent to internalize extensive historical experience directly into its latent space. Extensive experiments demonstrate the advantages of our method in long-term agent evolution.
CVSep 15, 2025
A Fully Open and Generalizable Foundation Model for Ultrasound Clinical ApplicationsHongyuan Zhang, Yuheng Wu, Mingyang Zhao et al.
Artificial intelligence (AI) that can effectively learn ultrasound representations by integrating multi-source data holds significant promise for advancing clinical care. However, the scarcity of large labeled datasets in real-world clinical environments and the limited generalizability of task-specific models have hindered the development of generalizable clinical AI models for ultrasound applications. In this study, we present EchoCare, a novel ultrasound foundation model for generalist clinical use, developed via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData. EchoCareData comprises 4.5 million ultrasound images, sourced from over 23 countries across 5 continents and acquired via a diverse range of distinct imaging devices, thus encompassing global cohorts that are multi-center, multi-device, and multi-ethnic. Unlike prior studies that adopt off-the-shelf vision foundation model architectures, we introduce a hierarchical classifier into EchoCare to enable joint learning of pixel-level and representation-level features, capturing both global anatomical contexts and local ultrasound characteristics. With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks of varying diagnostic difficulties, spanning disease diagnosis, lesion segmentation, organ detection, landmark prediction, quantitative regression, imaging enhancement and report generation. The code and pretrained model are publicly released, rendering EchoCare accessible for fine-tuning and local adaptation, supporting extensibility to additional applications. EchoCare provides a fully open and generalizable foundation model to boost the development of AI technologies for diverse clinical ultrasound applications.
LGApr 20, 2025
Semi-parametric Memory Consolidation: Towards Brain-like Deep Continual LearningGeng Liu, Fei Zhu, Rong Feng et al.
Humans and most animals inherently possess a distinctive capacity to continually acquire novel experiences and accumulate worldly knowledge over time. This ability, termed continual learning, is also critical for deep neural networks (DNNs) to adapt to the dynamically evolving world in open environments. However, DNNs notoriously suffer from catastrophic forgetting of previously learned knowledge when trained on sequential tasks. In this work, inspired by the interactive human memory and learning system, we propose a novel biomimetic continual learning framework that integrates semi-parametric memory and the wake-sleep consolidation mechanism. For the first time, our method enables deep neural networks to retain high performance on novel tasks while maintaining prior knowledge in real-world challenging continual learning scenarios, e.g., class-incremental learning on ImageNet. This study demonstrates that emulating biological intelligence provides a promising path to enable deep neural networks with continual learning capabilities.
LGMar 24, 2025
Global Convergence of Continual Learning on Non-IID DataFei Zhu, Yujing Liu, Wenzhuo Liu et al.
Continual learning, which aims to learn multiple tasks sequentially, has gained extensive attention. However, most existing work focuses on empirical studies, and the theoretical aspect remains under-explored. Recently, a few investigations have considered the theory of continual learning only for linear regressions, establishes the results based on the strict independent and identically distributed (i.i.d.) assumption and the persistent excitation on the feature data that may be difficult to verify or guarantee in practice. To overcome this fundamental limitation, in this paper, we provide a general and comprehensive theoretical analysis for continual learning of regression models. By utilizing the stochastic Lyapunov function and martingale estimation techniques, we establish the almost sure convergence results of continual learning under a general data condition for the first time. Additionally, without any excitation condition imposed on the data, the convergence rates for the forgetting and regret metrics are provided.
CVMar 5, 2025
DTU-Net: A Multi-Scale Dilated Transformer Network for Nonlinear Hyperspectral UnmixingChenTong Wang, Jincheng Gao, Fei Zhu et al.
Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the linear mixing model, which lacks the flexibility to accommodate scenarios where nonlinear effects are significant. To address these limitations, we propose a multi-scale Dilated Transformer-based unmixing network for nonlinear HU (DTU-Net). The encoder employs two branches. The first one performs multi-scale spatial feature extraction using Multi-Scale Dilated Attention (MSDA) in the Dilated Transformer, which varies dilation rates across attention heads to capture long-range and multi-scale spatial correlations. The second one performs spectral feature extraction utilizing 3D-CNNs with channel attention. The outputs from both branches are then fused to integrate multi-scale spatial and spectral information, which is subsequently transformed to estimate the abundances. The decoder is designed to accommodate both linear and nonlinear mixing scenarios. Its interpretability is enhanced by explicitly modeling the relationships between endmembers, abundances, and nonlinear coefficients in accordance with the polynomial post-nonlinear mixing model (PPNMM). Experiments on synthetic and real datasets validate the effectiveness of the proposed DTU-Net compared to PPNMM-derived methods and several advanced unmixing networks.
CVMay 15, 2024
Fourier Boundary Features Network with Wider Catchers for Glass SegmentationXiaolin Qin, Jiacen Liu, Qianlei Wang et al.
Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We proposed the Fourier Boundary Features Network with Wider Catchers (FBWC), which might be the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we designed the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method has been validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.
CVMar 27, 2024
Multi-scale Unified Network for Image ClassificationWenzhuo Liu, Fei Zhu, Cheng-Lin Liu
Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.
CVApr 1, 2020
Improving Deep Hyperspectral Image Classification Performance with Spectral UnmixingAlan J. X. Guo, Fei Zhu
Recent advances in neural networks have made great progress in the hyperspectral image (HSI) classification. However, the overfitting effect, which is mainly caused by complicated model structure and small training set, remains a major concern. Reducing the complexity of the neural networks could prevent overfitting to some extent, but also declines the networks' ability to express more abstract features. Enlarging the training set is also difficult, for the high expense of acquisition and manual labeling. In this paper, we propose an abundance-based multi-HSI classification method. Firstly, we convert every HSI from the spectral domain to the abundance domain by a dataset-specific autoencoder. Secondly, the abundance representations from multiple HSIs are collected to form an enlarged dataset. Lastly, we train an abundance-based classifier and employ the classifier to predict over all the involved HSI datasets. Different from the spectra that are usually highly mixed, the abundance features are more representative in reduced dimension with less noise. This benefits the proposed method to employ simple classifiers and enlarged training data, and to expect less overfitting issues. The effectiveness of the proposed method is verified by the ablation study and the comparative experiments.
CVMar 4, 2019
Hyperspectral Image Classification with Deep Metric Learning and Conditional Random FieldYi Liang, Xin Zhao, Alan J. X. Guo et al.
To improve the classification performance in the context of hyperspectral image processing, many works have been developed based on two common strategies, namely the spatial-spectral information integration and the utilization of neural networks. However, both strategies typically require more training data than the classical algorithms, aggregating the shortage of labeled samples. In this letter, we propose a novel framework that organically combines the spectrum-based deep metric learning model and the conditional random field algorithm. The deep metric learning model is supervised by the center loss to produce spectrum-based features that gather more tightly in Euclidean space within classes. The conditional random field with Gaussian edge potentials, which is firstly proposed for image segmentation tasks, is introduced to give the pixel-wise classification over the hyperspectral image by utilizing both the geographical distances between pixels and the Euclidean distances between the features produced by the deep metric learning model. The proposed framework is trained by spectral pixels at the deep metric learning stage and utilizes the half handcrafted spatial features at the conditional random field stage. This settlement alleviates the shortage of training data to some extent. Experiments on two real hyperspectral images demonstrate the advantages of the proposed method in terms of both classification accuracy and computation cost.
CVJan 31, 2018
A CNN-based Spatial Feature Fusion Algorithm for Hyperspectral Imagery ClassificationAlan J. X. Guo, Fei Zhu
The shortage of training samples remains one of the main obstacles in applying the artificial neural networks (ANN) to the hyperspectral images classification. To fuse the spatial and spectral information, pixel patches are often utilized to train a model, which may further aggregate this problem. In the existing works, an ANN model supervised by center-loss (ANNC) was introduced. Training merely with spectral information, the ANNC yields discriminative spectral features suitable for the subsequent classification tasks. In this paper, a CNN-based spatial feature fusion (CSFF) algorithm is proposed, which allows a smart fusion of the spatial information to the spectral features extracted by ANNC. As a critical part of CSFF, a CNN-based discriminant model is introduced to estimate whether two paring pixels belong to the same class. At the testing stage, by applying the discriminant model to the pixel-pairs generated by the test pixel and its neighbors, the local structure is estimated and represented as a customized convolutional kernel. The spectral-spatial feature is obtained by a convolutional operation between the estimated kernel and the corresponding spectral features within a neighborhood. At last, the label of the test pixel is predicted by classifying the resulting spectral-spatial feature. Without increasing the number of training samples or involving pixel patches at the training stage, the CSFF framework achieves the state-of-the-art by declining $20\%-50\%$ classification failures in experiments on three well-known hyperspectral images.
CVNov 20, 2017
Spectral-Spatial Feature Extraction and Classification by ANN Supervised with Center Loss in Hyperspectral ImageryAlan J. X. Guo, Fei Zhu
In this paper, we propose a spectral-spatial feature extraction and classification framework based on artificial neuron network (ANN) in the context of hyperspectral imagery. With limited labeled samples, only spectral information is exploited for training and spatial context is integrated posteriorly at the testing stage. Taking advantage of recent advances in face recognition, a joint supervision symbol that combines softmax loss and center loss is adopted to train the proposed network, by which intra-class features are gathered while inter-class variations are enlarged. Based on the learned architecture, the extracted spectrum-based features are classified by a center classifier. Moreover, to fuse the spectral and spatial information, an adaptive spectral-spatial center classifier is developed, where multiscale neighborhoods are considered simultaneously, and the final label is determined using an adaptive voting strategy. Finally, experimental results on three well-known datasets validate the effectiveness of the proposed methods compared with the state-of-the-art approaches.