Xu-Yao Zhang

CV
h-index34
54papers
2,624citations
Novelty49%
AI Score62

54 Papers

LGMar 6, 2023Code
Rethinking Confidence Calibration for Failure Prediction

Fei Zhu, Zhen Cheng, Xu-Yao Zhang et al.

Reliable confidence estimation for the predictions is important in many safety-critical applications. However, modern deep neural networks are often overconfident for their incorrect predictions. Recently, many calibration methods have been proposed to alleviate the overconfidence problem. With calibrated confidence, a primary and practical purpose is to detect misclassification errors by filtering out low-confidence predictions (known as failure prediction). In this paper, we find a general, widely-existed but actually-neglected phenomenon that most confidence calibration methods are useless or harmful for failure prediction. We investigate this problem and reveal that popular confidence calibration methods often lead to worse confidence separation between correct and incorrect samples, making it more difficult to decide whether to trust a prediction or not. Finally, inspired by the natural connection between flat minima and confidence separation, we propose a simple hypothesis: flat minima is beneficial for failure prediction. We verify this hypothesis via extensive experiments and further boost the performance by combining two different flat minima techniques. Our code is available at https://github.com/Impression2805/FMFP

LGMar 30, 2023Code
OpenMix: Exploring Outlier Samples for Misclassification Detection

Fei Zhu, Zhen Cheng, Xu-Yao Zhang et al.

Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes. The code is publicly available at https://github.com/Impression2805/OpenMix.

CVMar 20, 2022Code
Document Dewarping with Control Points

Guo-Wang Xie, Fei Yin, Xu-Yao Zhang et al.

Document images are now widely captured by handheld devices such as mobile phones. The OCR performance on these images are largely affected due to geometric distortion of the document paper, diverse camera positions and complex backgrounds. In this paper, we propose a simple yet effective approach to rectify distorted document image by estimating control points and reference points. After that, we use interpolation method between control points and reference points to convert sparse mappings to backward mapping, and remap the original distorted document image to the rectified image. Furthermore, control points are controllable to facilitate interaction or subsequent adjustment. We can flexibly select post-processing methods and the number of vertices according to different application scenarios. Experiments show that our approach can rectify document images with various distortion types, and yield state-of-the-art performance on real-world dataset. This paper also provides a training dataset based on control points for document dewarping. Both the code and the dataset are released at https://github.com/gwxie/Document-Dewarping-with-Control-Points.

LGMar 21, 2023Code
Dynamics-Aware Loss for Learning with Label Noise

Xiu-Chuan Li, Xiaobo Xia, Fei Zhu et al.

Label noise poses a serious threat to deep neural networks (DNNs). Employing robust loss functions which reconcile fitting ability with robustness is a simple but effective strategy to handle this problem. However, the widely-used static trade-off between these two factors contradicts the dynamics of DNNs learning with label noise, leading to inferior performance. Therefore, we propose a dynamics-aware loss (DAL) to solve this problem. Considering that DNNs tend to first learn beneficial patterns, then gradually overfit harmful label noise, DAL strengthens the fitting ability initially, then gradually improves robustness. Moreover, at the later stage, to further reduce the negative impact of label noise and combat underfitting simultaneously, we let DNNs put more emphasis on easy examples than hard ones and introduce a bootstrapping term. Both the detailed theoretical analyses and extensive experimental results demonstrate the superiority of our method. Our source code can be found in https://github.com/XiuchuanLi/DAL.

LGJul 18, 2023Code
Towards Trustworthy Dataset Distillation

Shijie Ma, Fei Zhu, Zhen Cheng et al.

Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD

CVSep 7, 2024Code
Local-Prompt: Extensible Local Prompts for Few-Shot Out-of-Distribution Detection

Fanhu Zeng, Zhen Cheng, Fei Zhu et al.

Out-of-Distribution (OOD) detection, aiming to distinguish outliers from known categories, has gained prominence in practical scenarios. Recently, the advent of vision-language models (VLM) has heightened interest in enhancing OOD detection for VLM through few-shot tuning. However, existing methods mainly focus on optimizing global prompts, ignoring refined utilization of local information with regard to outliers. Motivated by this, we freeze global prompts and introduce Local-Prompt, a novel coarse-to-fine tuning paradigm to emphasize regional enhancement with local prompts. Our method comprises two integral components: global prompt guided negative augmentation and local prompt enhanced regional regularization. The former utilizes frozen, coarse global prompts as guiding cues to incorporate negative augmentation, thereby leveraging local outlier knowledge. The latter employs trainable local prompts and a regional regularization to capture local information effectively, aiding in outlier identification. We also propose regional-related metric to empower the enrichment of OOD detection. Moreover, since our approach explores enhancing local prompts only, it can be seamlessly integrated with trained global prompts during inference to boost the performance. Comprehensive experiments demonstrate the effectiveness and potential of our method. Notably, our method reduces average FPR95 by 5.17% against state-of-the-art method in 4-shot tuning on challenging ImageNet-1k dataset, even outperforming 16-shot results of previous methods. Code is released at https://github.com/AuroraZengfh/Local-Prompt.

CVAug 29, 2024Code
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

Zhengqing Gao, Xiang Ao, Xu-Yao Zhang et al.

Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting with the test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. Through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. The code is available at https://github.com/gaozhengqing/TTPT

CVMar 26, 2022
A Survey of Robust Adversarial Training in Pattern Recognition: Fundamental, Theory, and Methodologies

Zhuang Qian, Kaizhu Huang, Qiu-Feng Wang et al.

In the last a few decades, deep neural networks have achieved remarkable success in machine learning, computer vision, and pattern recognition. Recent studies however show that neural networks (both shallow and deep) may be easily fooled by certain imperceptibly perturbed input samples called adversarial examples. Such security vulnerability has resulted in a large body of research in recent years because real-world threats could be introduced due to vast applications of neural networks. To address the robustness issue to adversarial examples particularly in pattern recognition, robust adversarial training has become one mainstream. Various ideas, methods, and applications have boomed in the field. Yet, a deep understanding of adversarial training including characteristics, interpretations, theories, and connections among different models has still remained elusive. In this paper, we present a comprehensive survey trying to offer a systematic and structured investigation on robust adversarial training in pattern recognition. We start with fundamentals including definition, notations, and properties of adversarial examples. We then introduce a unified theoretical framework for defending against adversarial samples - robust adversarial training with visualizations and interpretations on why adversarial training can lead to model robustness. Connections will be also established between adversarial training and other traditional learning theories. After that, we summarize, review, and discuss various methodologies with adversarial attack and defense/training algorithms in a structured way. Finally, we present analysis, outlook, and remarks of adversarial training.

LGMar 2, 2023
Average of Pruning: Improving Performance and Stability of Out-of-Distribution Detection

Zhen Cheng, Fei Zhu, Xu-Yao Zhang et al.

Detecting Out-of-distribution (OOD) inputs have been a critical issue for neural networks in the open world. However, the unstable behavior of OOD detection along the optimization trajectory during training has not been explored clearly. In this paper, we first find the performance of OOD detection suffers from overfitting and instability during training: 1) the performance could decrease when the training error is near zero, and 2) the performance would vary sharply in the final stage of training. Based on our findings, we propose Average of Pruning (AoP), consisting of model averaging and pruning, to mitigate the unstable behaviors. Specifically, model averaging can help achieve a stable performance by smoothing the landscape, and pruning is certified to eliminate the overfitting by eliminating redundant features. Comprehensive experiments on various datasets and architectures are conducted to verify the effectiveness of our method.

CVSep 2, 2024
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Yi Chen, Jian Xu, Xu-Yao Zhang et al.

With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

CVNov 22, 2023
Unified Classification and Rejection: A One-versus-All Framework

Zhen Cheng, Xu-Yao Zhang, Cheng-Lin Liu

Classifying patterns of known classes and rejecting ambiguous and novel (also called as out-of-distribution (OOD)) inputs are involved in open world pattern recognition. Deep neural network models usually excel in closed-set classification while performs poorly in rejecting OOD inputs. To tackle this problem, numerous methods have been designed to perform open set recognition (OSR) or OOD rejection/detection tasks. Previous methods mostly take post-training score transformation or hybrid models to ensure low scores on OOD inputs while separating known classes. In this paper, we attempt to build a unified framework for building open set classifiers for both classification and OOD rejection. We formulate the open set recognition of $ K $-known-class as a $ (K+1) $-class classification problem with model trained on known-class samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all (OVA) binary classification tasks and binding some parameters, we show that combining the scores of OVA classifiers can give $ (K+1) $-class posterior probabilities, which enables classification and OOD rejection in a unified framework. To maintain the closed-set classification accuracy of the OVA trained classifier, we propose a hybrid training strategy combining OVA loss and multi-class cross-entropy loss. We implement the OVA framework and hybrid training strategy on the recently proposed convolutional prototype network and prototype classifier on vision transformer (ViT) backbone. Experiments on popular OSR and OOD detection datasets demonstrate that the proposed framework, using a single multi-class classifier, yields competitive performance in closed-set classification, OOD detection, and misclassification detection.

CVJul 19, 2024
PASS++: A Dual Bias Reduction Framework for Non-Exemplar Class-Incremental Learning

Fei Zhu, Xu-Yao Zhang, Zhen Cheng et al.

Class-incremental learning (CIL) aims to recognize new classes incrementally while maintaining the discriminability of old classes. Most existing CIL methods are exemplar-based, i.e., storing a part of old data for retraining. Without relearning old data, those methods suffer from catastrophic forgetting. In this paper, we figure out two inherent problems in CIL, i.e., representation bias and classifier bias, that cause catastrophic forgetting of old knowledge. To address these two biases, we present a simple and novel dual bias reduction framework that employs self-supervised transformation (SST) in input space and prototype augmentation (protoAug) in deep feature space. On the one hand, SST alleviates the representation bias by learning generic and diverse representations that can transfer across different tasks. On the other hand, protoAug overcomes the classifier bias by explicitly or implicitly augmenting prototypes of old classes in the deep feature space, which poses tighter constraints to maintain previously learned decision boundaries. We further propose hardness-aware prototype augmentation and multi-view ensemble strategies, leading to significant improvements. The proposed framework can be easily integrated with pre-trained models. Without storing any samples of old classes, our method can perform comparably with state-of-the-art exemplar-based approaches which store plenty of old data. We hope to draw the attention of researchers back to non-exemplar CIL by rethinking the necessity of storing old samples in CIL.

67.0CVMar 18Code
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Ziwei Xiang, Fanhu Zeng, Hongjian Fang et al.

Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.

CVJan 28Code
Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective

Qiyan Zhao, Xiaofeng Zhang, Shuochen Chang et al.

Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model's growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA

AIDec 18, 2025
PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving

Jianming Liu, Ren Zhu, Jian Xu et al.

Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics-Informed Neural Network (PINNs) introduced end-to-end neural network-based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM-driven agents and introduce PDE-Agent, the first toolchain-augmented multi-agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE-Agent leverages the strengths of multi-agent and multi-tool collaboration through two key innovations: (1) A Prog-Act framework with graph memory for multi-agent collaboration, which enables effective dynamic planning and error correction via dual-loop mechanisms (localized fixes and global revisions). (2) A Resource-Pool integrated with a tool-parameter separation mechanism for multi-tool collaboration. This centralizes the management of runtime artifacts and resolves inter-tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE-Bench, a multi-type PDE Benchmark for agent-based tool collaborative solving, and propose multi-level metrics for assessing tool coordination. Evaluations verify that PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks. This new paradigm of toolchain-augmented multi-agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.

CVAug 17, 2024
StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Jiao Zhang, Jian Xu, Xu-Yao Zhang et al.

In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.

CVSep 12, 2023
Towards Reliable Domain Generalization: A New Dataset and Evaluations

Jiao Zhang, Xu-Yao Zhang, Cheng-Lin Liu

There are ubiquitous distribution shifts in the real world. However, deep neural networks (DNNs) are easily biased towards the training set, which causes severe performance degradation when they receive out-of-distribution data. Many methods are studied to train models that generalize under various distribution shifts in the literature of domain generalization (DG). However, the recent DomainBed and WILDS benchmarks challenged the effectiveness of these methods. Aiming at the problems in the existing research, we propose a new domain generalization task for handwritten Chinese character recognition (HCCR) to enrich the application scenarios of DG method research. We evaluate eighteen DG methods on the proposed PaHCC (Printed and Handwritten Chinese Characters) dataset and show that the performance of existing methods on this dataset is still unsatisfactory. Besides, under a designed dynamic DG setting, we reveal more properties of DG methods and argue that only the leave-one-domain-out protocol is unreliable. We advocate that researchers in the DG community refer to dynamic performance of methods for more comprehensive and reliable evaluation. Our dataset and evaluations bring new perspectives to the community for more substantial progress. We will make our dataset public with the article published to facilitate the study of domain generalization.

AINov 11, 2025
SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning

Xuchen Li, Ruitao Wu, Xuanbo Liu et al.

Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem's domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity's Last Exam (HLE) benchmark, further confirming the system's ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.

CVMar 7, 2024Code
Active Generalized Category Discovery

Shijie Ma, Fei Zhu, Zhun Zhong et al.

Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task, which endeavors to cluster unlabeled samples from both novel and old classes, leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes, and that novel categories are fully unlabeled, GCD inherently faces intractable problems, including imbalanced classification performance and inconsistent confidence between old and new classes, especially in the low-labeling regime. Hence, some annotations of new classes are deemed necessary. However, labeling new classes is extremely costly. To address this issue, we take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD). The goal is to improve the performance of GCD by actively selecting a limited amount of valuable samples for labeling from the oracle. To solve this problem, we devise an adaptive sampling strategy, which jointly considers novelty, informativeness and diversity to adaptively select novel samples with proper uncertainty. However, owing to the varied orderings of label indices caused by the clustering of novel classes, the queried labels are not directly applicable to subsequent training. To overcome this issue, we further propose a stable label mapping algorithm that transforms ground truth labels to the label space of the classifier, thereby ensuring consistent training across different active selection stages. Our method achieves state-of-the-art performance on both generic and fine-grained datasets. Our code is available at https://github.com/mashijie1028/ActiveGCD

CVJan 4, 2024Code
PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning

Haiyang Guo, Fei Zhu, Wenzhuo Liu et al.

Existing federated learning methods have effectively dealt with decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to classify all seen classes. To effectively mitigate catastrophic forgetting and data heterogeneity under low communication costs, we propose a simple and effective method named PILoRA. On the one hand, we adopt prototype learning to learn better feature representations and leverage the heuristic information between prototypes and class features to design a prototype re-weight module to solve the classifier bias caused by data heterogeneity without retraining the classifier. On the other hand, we view incremental learning as the process of learning distinct task vectors and encoding them within different LoRA parameters. Accordingly, we propose Incremental LoRA to mitigate catastrophic forgetting. Experimental results on standard datasets indicate that our method outperforms the state-of-the-art approaches significantly. More importantly, our method exhibits strong robustness and superiority in different settings and degrees of data heterogeneity. The code is available at \url{https://github.com/Ghy0501/PILoRA}.

CVMar 5, 2024Code
Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Fei Zhu, Xu-Yao Zhang, Zhen Cheng et al.

Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}.

CVApr 9, 2024Code
Unified Entropy Optimization for Open-Set Test-Time Adaptation

Zhengqing Gao, Xu-Yao Zhang, Cheng-Lin Liu

Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain. Existing methods usually focus on improving TTA performance under covariate shifts, while neglecting semantic shifts. In this paper, we delve into a realistic open-set TTA setting where the target domain may contain samples from unknown classes. Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios, which can be attributed to the inaccurate estimation of data distribution and model confidence. To address these issues, we propose a simple but effective framework called unified entropy optimization (UniEnt), which is capable of simultaneously adapting to covariate-shifted in-distribution (csID) data and detecting covariate-shifted out-of-distribution (csOOD) data. Specifically, UniEnt first mines pseudo-csID and pseudo-csOOD samples from test data, followed by entropy minimization on the pseudo-csID data and entropy maximization on the pseudo-csOOD data. Furthermore, we introduce UniEnt+ to alleviate the noise caused by hard data partition leveraging sample-level confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show the superiority of our framework. The code is available at https://github.com/gaozhengqing/UniEnt

LGApr 2, 2025Code
ProtoGCD: Unified and Unbiased Prototype Learning for Generalized Category Discovery

Shijie Ma, Fei Zhu, Xu-Yao Zhang et al.

Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, {enabling unified modeling between old and new classes}. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at https://github.com/mashijie1028/ProtoGCD.

CLMar 17, 2025Code
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang et al.

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.

LGMar 17, 2025Code
Federated Continual Instruction Tuning

Haiyang Guo, Fanhu Zeng, Fei Zhu et al.

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.

LGNov 28, 2024Code
DESIRE: Dynamic Knowledge Consolidation for Rehearsal-Free Continual Learning

Haiyang Guo, Fei Zhu, Fanhu Zeng et al.

Continual learning aims to equip models with the ability to retain previously learned knowledge like a human. Recent work incorporating Parameter-Efficient Fine-Tuning has revitalized the field by introducing lightweight extension modules. However, existing methods usually overlook the issue of information leakage caused by the fact that the experiment data have been used in pre-trained models. Once these duplicate data are removed in the pre-training phase, their performance can be severely affected. In this paper, we propose a new LoRA-based rehearsal-free method named DESIRE. Our method avoids imposing additional constraints during training to mitigate catastrophic forgetting, thereby maximizing the learning of new classes. To integrate knowledge from old and new tasks, we propose two efficient post-processing modules. On the one hand, we retain only two sets of LoRA parameters for merging and propose dynamic representation consolidation to calibrate the merged feature representation. On the other hand, we propose decision boundary refinement to address classifier bias when training solely on new class data. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple datasets and strikes an effective balance between stability and plasticity. Our code will be publicly available.

LGJun 16, 2025Code
Continual Learning for Generative AI: From LLMs to MLLMs and Beyond

Haiyang Guo, Fanhu Zeng, Fei Zhu et al.

The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.

CLJan 13
Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Rongji Li, Jian Xu, Xueqing Chen et al.

In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.

CVAug 10, 2025Code
MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Haiyang Guo, Fei Zhu, Hongbo Zhao et al.

Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.

75.9CVApr 1
CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

Haiyang Guo, Yichen Shi, Fei Zhu et al.

Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

AIOct 16, 2025Code
MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

Xukai Wang, Xuanbo Liu, Mingrui Chen et al.

With the advancement of powerful large-scale reasoning models, effectively evaluating the reasoning capabilities of these models has become increasingly important. However, existing benchmarks designed to assess the reasoning abilities of large models tend to be limited in scope and lack the flexibility to adapt their difficulty according to the evolving reasoning capacities of the models. To address this, we propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models and can adjust and update question difficulty based on the reasoning abilities of advanced models. Specifically, we curate the benchmark by selecting and collecting complex reasoning questions from existing benchmarks and sources such as Olympiad-level competitions. Additionally, MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model's reasoning process. Furthermore, it includes questions generated using simulation software, enabling dynamic adjustment of benchmark difficulty with minimal resource consumption. We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5. MorphoBench enhances the comprehensiveness and validity of model reasoning evaluation, providing reliable guidance for improving both the reasoning abilities and scientific robustness of large models. The code has been released in https://github.com/OpenDCAI/MorphoBench.

LGApr 7, 2025Code
Achieving binary weight and activation for LLMs using Post-Training Quantization

Siqing Song, Chuang Wang, Ruiqi Wang et al.

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.

LGMar 4, 2024
Open-world machine learning: A review and new outlooks

Fei Zhu, Shijie Ma, Zhen Cheng et al.

Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In such cases, rejecting unknowns, discovering novelties, and then continually learning them, could enable models to be safe and evolve continually as biological systems do. This article presents a holistic view of open-world machine learning by investigating unknown rejection, novelty discovery, and continual learning in a unified paradigm. The challenges, principles, and limitations of current methodologies are discussed in detail. Furthermore, widely used benchmarks, metrics, and performances are summarized. Finally, we discuss several potential directions for further progress in the field. By providing a comprehensive introduction to the emerging open-world machine learning paradigm, this article aims to help researchers build more powerful AI systems in their respective fields, and to promote the development of artificial general intelligence.

CVMar 11, 2024
Ensemble Quadratic Assignment Network for Graph Matching

Haoru Tan, Chuang Wang, Sitong Wu et al.

Graph matching is a commonly used technique in computer vision and pattern recognition. Recent data-driven approaches have improved the graph matching accuracy remarkably, whereas some traditional algorithm-based methods are more robust to feature noises, outlier nodes, and global transformation (e.g.~rotation). In this paper, we propose a graph neural network (GNN) based approach to combine the advantages of data-driven and traditional methods. In the GNN framework, we transform traditional graph-matching solvers as single-channel GNNs on the association graph and extend the single-channel architecture to the multi-channel network. The proposed model can be seen as an ensemble method that fuses multiple algorithms at every iteration. Instead of averaging the estimates at the end of the ensemble, in our approach, the independent iterations of the ensembled algorithms exchange their information after each iteration via a 1x1 channel-wise convolution layer. Experiments show that our model improves the performance of traditional algorithms significantly. In addition, we propose a random sampling strategy to reduce the computational complexity and GPU memory usage, so the model applies to matching graphs with thousands of nodes. We evaluate the performance of our method on three tasks: geometric graph matching, semantic feature matching, and few-shot 3D shape classification. The proposed model performs comparably or outperforms the best existing GNN-based methods.

85.2LGApr 21
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Siqing Song, Chuang Wang, Yong Lang et al.

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

CVMar 26, 2025
Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

Fanhu Zeng, Zhen Cheng, Fei Zhu et al.

Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.

CVJul 29, 2025
The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM

Shibo Gao, Peipei Yang, Haiyang Guo et al.

Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos, serving as a core technology in the fields of intelligent surveillance and public safety. With the advancement of deep learning, the continuous evolution of deep model architectures has driven innovation in VAD methodologies, significantly enhancing feature representation and scene adaptability, thereby improving algorithm generalization and expanding application boundaries. More importantly, the rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field. Under the support of MLLMs and LLMs, VAD has undergone significant transformations in terms of data annotation, input modalities, model architectures, and task objectives. The surge in publications and the evolution of tasks have created an urgent need for systematic reviews of recent advancements. This paper presents the first comprehensive survey analyzing VAD methods based on MLLMs and LLMs, providing an in-depth discussion of the changes occurring in the VAD field in the era of large models and their underlying causes. Additionally, this paper proposes a unified framework that encompasses both deep neural network (DNN)-based and LLM-based VAD methods, offering a thorough analysis of the new VAD paradigms empowered by LLMs, constructing a classification system, and comparing their strengths and weaknesses. Building on this foundation, this paper focuses on current VAD methods based on MLLMs/LLMs. Finally, based on the trajectory of technological advancements and existing bottlenecks, this paper distills key challenges and outlines future research directions, offering guidance for the VAD community.

CVNov 19, 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Songze Li, Mingyu Gao, Tonghua Su et al.

Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

CLOct 27, 2025
MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Xinming Wang, Jian Xu, Bin Yu et al.

Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

CVApr 14, 2025
Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection

Songze Li, Qixing Xu, Tonghua Su et al.

The balance between stability and plasticity remains a fundamental challenge in pretrained model-based incremental object detection (PTMIOD). While existing PTMIOD methods demonstrate strong performance on in-domain tasks aligned with pretraining data, their plasticity to cross-domain scenarios remains underexplored. Through systematic component-wise analysis of pretrained detectors, we reveal a fundamental discrepancy: the localization modules demonstrate inherent cross-domain stability-preserving precise bounding box estimation across distribution shifts-while the classification components require enhanced plasticity to mitigate discriminability degradation in cross-domain scenarios. Motivated by these findings, we propose a dual-path framework built upon pretrained DETR-based detectors which decouples localization stability and classification plasticity: the localization path maintains stability to preserve pretrained localization knowledge, while the classification path facilitates plasticity via parameter-efficient fine-tuning and resists forgetting with pseudo-feature replay. Extensive evaluations on both in-domain (MS COCO and PASCAL VOC) and cross-domain (TT100K) benchmarks show state-of-the-art performance, demonstrating our method's ability to effectively balance stability and plasticity in PTMIOD, achieving robust cross-domain adaptation and strong retention of anti-forgetting capabilities.

CVApr 9, 2025
DUKAE: DUal-level Knowledge Accumulation and Ensemble for Pre-Trained Model-Based Continual Learning

Songze Li, Tonghua Su, Xu-Yao Zhang et al.

Pre-trained model-based continual learning (PTMCL) has garnered growing attention, as it enables more rapid acquisition of new knowledge by leveraging the extensive foundational understanding inherent in pre-trained model (PTM). Most existing PTMCL methods use Parameter-Efficient Fine-Tuning (PEFT) to learn new knowledge while consolidating existing memory. However, they often face some challenges. A major challenge lies in the misalignment of classification heads, as the classification head of each task is trained within a distinct feature space, leading to inconsistent decision boundaries across tasks and, consequently, increased forgetting. Another critical limitation stems from the restricted feature-level knowledge accumulation, with feature learning typically restricted to the initial task only, which constrains the model's representation capabilities. To address these issues, we propose a method named DUal-level Knowledge Accumulation and Ensemble (DUKAE) that leverages both feature-level and decision-level knowledge accumulation by aligning classification heads into a unified feature space through Gaussian distribution sampling and introducing an adaptive expertise ensemble to fuse knowledge across feature subspaces. Extensive experiments on CIFAR-100, ImageNet-R, CUB-200, and Cars-196 datasets demonstrate the superior performance of our approach.

CVJan 14, 2022
Emergence of Machine Language: Towards Symbolic Intelligence with Neural Networks

Yuqi Wang, Xu-Yao Zhang, Cheng-Lin Liu et al.

Representation is a core issue in artificial intelligence. Humans use discrete language to communicate and learn from each other, while machines use continuous features (like vector, matrix, or tensor in deep neural networks) to represent cognitive patterns. Discrete symbols are low-dimensional, decoupled, and have strong reasoning ability, while continuous features are high-dimensional, coupled, and have incredible abstracting capabilities. In recent years, deep learning has developed the idea of continuous representation to the extreme, using millions of parameters to achieve high accuracies. Although this is reasonable from the statistical perspective, it has other major problems like lacking interpretability, poor generalization, and is easy to be attacked. Since both paradigms have strengths and weaknesses, a better choice is to seek reconciliation. In this paper, we make an initial attempt towards this direction. Specifically, we propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation. This process is highly similar to human language, which is a natural combination of discrete symbols and neural systems, where the brain processes continuous signals and represents intelligence via discrete language. To mimic this functionality, we denote our approach as machine language. By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language through cooperation. Moreover, through experiments we show that discrete language representation has several advantages compared with continuous feature representation, from the aspects of interpretability, generalization, and robustness.

CVApr 14, 2021
Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

Guo-Wang Xie, Fei Yin, Xu-Yao Zhang et al.

As camera-based documents are increasingly used, the rectification of distorted document images becomes a need to improve the recognition performance. In this paper, we propose a novel framework for both rectifying distorted document image and removing background finely, by estimating pixel-wise displacements using a fully convolutional network (FCN). The document image is rectified by transformation according to the displacements of pixels. The FCN is trained by regressing displacements of synthesized distorted documents, and to control the smoothness of displacements, we propose a Local Smooth Constraint (LSC) in regularization. Our approach is easy to implement and consumes moderate computing resource. Experiments proved that our approach can dewarp document images effectively under various geometric distortions, and has achieved the state-of-the-art performance in terms of local details and overall effect.

CVDec 1, 2020
Weakly-Supervised Arbitrary-Shaped Text Detection with Expectation-Maximization Algorithm

Mengbiao Zhao, Wei Feng, Fei Yin et al.

Arbitrary-shaped text detection is an important and challenging task in computer vision. Most existing methods require heavy data labeling efforts to produce polygon-level text region labels for supervised training. In order to reduce the cost in data labeling, we study weakly-supervised arbitrary-shaped text detection for combining various weak supervision forms (e.g., image-level tags, coarse, loose and tight bounding boxes), which are far easier for annotation. We propose an Expectation-Maximization (EM) based weakly-supervised learning framework to train an accurate arbitrary-shaped text detector using only a small amount of polygon-level annotated data combined with a large amount of weakly annotated data. Meanwhile, we propose a contour-based arbitrary-shaped text detector, which is suitable for incorporating weakly-supervised learning. Extensive experiments on three arbitrary-shaped text benchmarks (CTW1500, Total-Text and ICDAR-ArT) show that (1) using only 10% strongly annotated data and 90% weakly annotated data, our method yields comparable performance to state-of-the-art methods, (2) with 100% strongly annotated data, our method outperforms existing methods on all three benchmarks. We will make the weakly annotated datasets publicly available in the future.

CVJun 12, 2020
Towards Robust Pattern Recognition: A Review

Xu-Yao Zhang, Cheng-Lin Liu, Ching Y. Suen

The accuracies for many pattern recognition tasks have increased rapidly year by year, achieving or even outperforming human performance. From the perspective of accuracy, pattern recognition seems to be a nearly-solved problem. However, once launched in real applications, the high-accuracy pattern recognition systems may become unstable and unreliable, due to the lack of robustness in open and changing environments. In this paper, we present a comprehensive review of research towards robust pattern recognition from the perspective of breaking three basic and implicit assumptions: closed-world assumption, independent and identically distributed assumption, and clean and big data assumption, which form the foundation of most pattern recognition models. Actually, our brain is robust at learning concepts continually and incrementally, in complex, open and changing environments, with different contexts, modalities and tasks, by showing only a few examples, under weak or noisy supervision. These are the major differences between human intelligence and machine intelligence, which are closely related to the above three assumptions. After witnessing the significant progress in accuracy improvement nowadays, this review paper will enable us to analyze the shortcomings and limitations of current methods and identify future research directions for robust pattern recognition.

CVMay 12, 2019
Predictive Ensemble Learning with Application to Scene Text Detection

Danlu Chen, Xu-Yao Zhang, Wei Zhang et al.

Deep learning based approaches have achieved significant progresses in different tasks like classification, detection, segmentation, and so on. Ensemble learning is widely known to further improve performance by combining multiple complementary models. It is easy to apply ensemble learning for classification tasks, for example, based on averaging, voting, or other methods. However, for other tasks (like object detection) where the outputs are varying in quantity and unable to be simply compared, the ensemble of multiple models become difficult. In this paper, we propose a new method called Predictive Ensemble Learning (PEL), based on powerful predictive ability of deep neural networks, to directly predict the best performing model among a pool of base models for each test example, thus transforming ensemble learning to a traditional classification task. Taking scene text detection as the application, where no suitable ensemble learning strategy exists, PEL can significantly improve the performance, compared to either individual state-of-the-art models, or the fusion of multiple models by non-maximum suppression. Experimental results show the possibility and potential of PEL in predicting different models' performance based only on a query example, which can be extended for ensemble learning in many other complex tasks.

CVJun 2, 2018
SCAN: Sliding Convolutional Attention Network for Scene Text Recognition

Yi-Chao Wu, Fei Yin, Xu-Yao Zhang et al.

Scene text recognition has drawn great attentions in the community of computer vision and artificial intelligence due to its challenges and wide applications. State-of-the-art recurrent neural networks (RNN) based models map an input sequence to a variable length output sequence, but are usually applied in a black box manner and lack of transparency for further improvement, and the maintaining of the entire past hidden states prevents parallel computation in a sequence. In this paper, we investigate the intrinsic characteristics of text recognition, and inspired by human cognition mechanisms in reading texts, we propose a scene text recognition method with sliding convolutional attention network (SCAN). Similar to the eye movement during reading, the process of SCAN can be viewed as an alternation between saccades and visual fixations. Compared to the previous recurrent models, computations over all elements of SCAN can be fully parallelized during training. Experimental results on several challenging benchmarks, including the IIIT5k, SVT and ICDAR 2003/2013 datasets, demonstrate the superiority of SCAN over state-of-the-art methods in terms of both the model interpretability and performance.

CVMay 9, 2018
Robust Classification with Convolutional Prototype Learning

Hong-Ming Yang, Xu-Yao Zhang, Fei Yin et al.

Convolutional neural networks (CNNs) have been widely used for image classification. Despite its high accuracies, CNN has been shown to be easily fooled by some adversarial examples, indicating that CNN is not robust enough for pattern classification. In this paper, we argue that the lack of robustness for CNN is caused by the softmax layer, which is a totally discriminative model and based on the assumption of closed world (i.e., with a fixed number of categories). To improve the robustness, we propose a novel learning framework called convolutional prototype learning (CPL). The advantage of using prototypes is that it can well handle the open world recognition problem and therefore improve the robustness. Under the framework of CPL, we design multiple classification criteria to train the network. Moreover, a prototype loss (PL) is proposed as a regularization to improve the intra-class compactness of the feature representation, which can be viewed as a generative model based on the Gaussian assumption of different classes. Experiments on several datasets demonstrate that CPL can achieve comparable or even better results than traditional CNN, and from the robustness perspective, CPL shows great advantages for both the rejection and incremental category learning tasks.

LGOct 27, 2017
Stochastic Conjugate Gradient Algorithm with Variance Reduction

Xiao-Bo Jin, Xu-Yao Zhang, Kaizhu Huang et al.

Conjugate gradient (CG) methods are a class of important methods for solving linear equations and nonlinear optimization problems. In this paper, we propose a new stochastic CG algorithm with variance reduction and we prove its linear convergence with the Fletcher and Reeves method for strongly convex and smooth functions. We experimentally demonstrate that the CG with variance reduction algorithm converges faster than its counterparts for four learning models, which may be convex, nonconvex or nonsmooth. In addition, its area under the curve performance on six large-scale data sets is comparable to that of the LIBLINEAR solver for the L2-regularized L2-loss but with a significant improvement in computational efficiency

CVSep 6, 2017
Scene Text Recognition with Sliding Convolutional Character Models

Fei Yin, Yi-Chao Wu, Xu-Yao Zhang et al.

Scene text recognition has attracted great interests from the computer vision and pattern recognition community in recent years. State-of-the-art methods use concolutional neural networks (CNNs), recurrent neural networks with long short-term memory (RNN-LSTM) or the combination of them. In this paper, we investigate the intrinsic characteristics of text recognition, and inspired by human cognition mechanisms in reading texts, we propose a scene text recognition method with character models on convolutional feature map. The method simultaneously detects and recognizes characters by sliding the text line image with character models, which are learned end-to-end on text line images labeled with text transcripts. The character classifier outputs on the sliding windows are normalized and decoded with Connectionist Temporal Classification (CTC) based algorithm. Compared to previous methods, our method has a number of appealing properties: (1) It avoids the difficulty of character segmentation which hinders the performance of segmentation-based recognition methods; (2) The model can be trained simply and efficiently because it avoids gradient vanishing/exploding in training RNN-LSTM based models; (3) It bases on character models trained free of lexicon, and can recognize unknown words. (4) The recognition process is highly parallel and enables fast recognition. Our experiments on several challenging English and Chinese benchmarks, including the IIIT-5K, SVT, ICDAR03/13 and TRW15 datasets, demonstrate that the proposed method yields superior or comparable performance to state-of-the-art methods while the model size is relatively small.