Cheng-Lin Liu

CV
h-index36
71papers
4,604citations
Novelty49%
AI Score63

71 Papers

AIFeb 22, 2023Code
A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram

Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu

Geometry problem solving (GPS) is a high-level mathematical reasoning requiring the capacities of multi-modal fusion and geometric knowledge application. Recently, neural solvers have shown great potential in GPS but still be short in diagram presentation and modal fusion. In this work, we convert diagrams into basic textual clauses to describe diagram features effectively, and propose a new neural solver called PGPSNet to fuse multi-modal information efficiently. Combining structural and semantic pre-training, data augmentation and self-limited decoding, PGPSNet is endowed with rich knowledge of geometry theorems and geometric representation, and therefore promotes geometric understanding and reasoning. In addition, to facilitate the research of GPS, we build a new large-scale and fine-annotated GPS dataset named PGPS9K, labeled with both fine-grained diagram annotation and interpretable solution program. Experiments on PGPS9K and an existing dataset Geometry3K validate the superiority of our method over the state-of-the-art neural solvers. Our code, dataset and appendix material are available at \url{https://github.com/mingliangzhang2018/PGPS}.

LGMar 6, 2023Code
Rethinking Confidence Calibration for Failure Prediction

Fei Zhu, Zhen Cheng, Xu-Yao Zhang et al.

Reliable confidence estimation for the predictions is important in many safety-critical applications. However, modern deep neural networks are often overconfident for their incorrect predictions. Recently, many calibration methods have been proposed to alleviate the overconfidence problem. With calibrated confidence, a primary and practical purpose is to detect misclassification errors by filtering out low-confidence predictions (known as failure prediction). In this paper, we find a general, widely-existed but actually-neglected phenomenon that most confidence calibration methods are useless or harmful for failure prediction. We investigate this problem and reveal that popular confidence calibration methods often lead to worse confidence separation between correct and incorrect samples, making it more difficult to decide whether to trust a prediction or not. Finally, inspired by the natural connection between flat minima and confidence separation, we propose a simple hypothesis: flat minima is beneficial for failure prediction. We verify this hypothesis via extensive experiments and further boost the performance by combining two different flat minima techniques. Our code is available at https://github.com/Impression2805/FMFP

LGMar 30, 2023Code
OpenMix: Exploring Outlier Samples for Misclassification Detection

Fei Zhu, Zhen Cheng, Xu-Yao Zhang et al.

Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes. The code is publicly available at https://github.com/Impression2805/OpenMix.

CVMay 19, 2022Code
Plane Geometry Diagram Parsing

Ming-Liang Zhang, Fei Yin, Yi-Han Hao et al.

Geometry diagram parsing plays a key role in geometry problem solving, wherein the primitive extraction and relation parsing remain challenging due to the complex layout and between-primitive relationship. In this paper, we propose a powerful diagram parser based on deep learning and graph reasoning. Specifically, a modified instance segmentation method is proposed to extract geometric primitives, and the graph neural network (GNN) is leveraged to realize relation parsing and primitive classification incorporating geometric features and prior knowledge. All the modules are integrated into an end-to-end model called PGDPNet to perform all the sub-tasks simultaneously. In addition, we build a new large-scale geometry diagram dataset named PGDP5K with primitive level annotations. Experiments on PGDP5K and an existing dataset IMP-Geometry3K show that our model outperforms state-of-the-art methods in four sub-tasks remarkably. Our code, dataset and appendix material are available at https://github.com/mingliangzhang2018/PGDP.

CVMar 20, 2022Code
Document Dewarping with Control Points

Guo-Wang Xie, Fei Yin, Xu-Yao Zhang et al.

Document images are now widely captured by handheld devices such as mobile phones. The OCR performance on these images are largely affected due to geometric distortion of the document paper, diverse camera positions and complex backgrounds. In this paper, we propose a simple yet effective approach to rectify distorted document image by estimating control points and reference points. After that, we use interpolation method between control points and reference points to convert sparse mappings to backward mapping, and remap the original distorted document image to the rectified image. Furthermore, control points are controllable to facilitate interaction or subsequent adjustment. We can flexibly select post-processing methods and the number of vertices according to different application scenarios. Experiments show that our approach can rectify document images with various distortion types, and yield state-of-the-art performance on real-world dataset. This paper also provides a training dataset based on control points for document dewarping. Both the code and the dataset are released at https://github.com/gwxie/Document-Dewarping-with-Control-Points.

CVMar 16, 2023Code
Cross-Modal Causal Intervention for Medical Report Generation

Weixing Chen, Yang Liu, Ce Wang et al.

Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding radiology reports according to the given radiology image. However, generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases and inherent limitations of radiological imaging, such as low resolution and noise interference. To address these issues, we propose a two-stage framework named CrossModal Causal Representation Learning (CMCRL), consisting of the Radiological Cross-modal Alignment and Reconstruction Enhanced (RadCARE) pre-training and the Visual-Linguistic Causal Intervention (VLCI) fine-tuning. In the pre-training stage, RadCARE introduces a degradation-aware masked image restoration strategy tailored for radiological images, which reconstructs high-resolution patches from low-resolution inputs to mitigate noise and detail loss. Combined with a multiway architecture and four adaptive training strategies (e.g., text postfix generation with degraded images and text prefixes), RadCARE establishes robust cross-modal correlations even with incomplete data. In the VLCI phase, we deploy causal front-door intervention through two modules: the Visual Deconfounding Module (VDM) disentangles local-global features without fine-grained annotations, while the Linguistic Deconfounding Module (LDM) eliminates context bias without external terminology databases. Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods, with ablation studies confirming the necessity of both stages. Code and models are available at https://github.com/WissingChen/CMCRL.

LGMar 21, 2023Code
Dynamics-Aware Loss for Learning with Label Noise

Xiu-Chuan Li, Xiaobo Xia, Fei Zhu et al.

Label noise poses a serious threat to deep neural networks (DNNs). Employing robust loss functions which reconcile fitting ability with robustness is a simple but effective strategy to handle this problem. However, the widely-used static trade-off between these two factors contradicts the dynamics of DNNs learning with label noise, leading to inferior performance. Therefore, we propose a dynamics-aware loss (DAL) to solve this problem. Considering that DNNs tend to first learn beneficial patterns, then gradually overfit harmful label noise, DAL strengthens the fitting ability initially, then gradually improves robustness. Moreover, at the later stage, to further reduce the negative impact of label noise and combat underfitting simultaneously, we let DNNs put more emphasis on easy examples than hard ones and introduce a bootstrapping term. Both the detailed theoretical analyses and extensive experimental results demonstrate the superiority of our method. Our source code can be found in https://github.com/XiuchuanLi/DAL.

AIJun 3
BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

Qi Wang, Peijie Wang, Fei Yin et al.

Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework that establishes Bidirectional Neuro-Symbolic Interaction (BiNS) between a MLLM Adviser and a Symbolic Solver. MLLM Adviser actively incorporates feedback from the symbolic solver to dynamically rectify inconsistent formal representations or propose auxiliary hypotheses, resolving symbolic conflicts and facilitating complex deductions.

CVJun 5, 2023
ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Wenwen Yu, Chengquan Zhang, Haoyu Cao et al.

Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extraction from Visually-Rich Document images (SVRD). We set up two tracks for SVRD including Track 1: HUST-CELL and Track 2: Baidu-FEST, where HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling, and Baidu-FEST focuses on evaluating the performance and generalization of Zero-shot / Few-shot Structured Text extraction from an end-to-end perspective. Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, and submission summaries. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios. It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.

CVMar 10, 2022
Towards Open-Set Text Recognition via Label-to-Prototype Learning

Chang Liu, Chun Yang, Hai-Bo Qin et al.

Scene text recognition is a popular topic and extensively used in the industry. Although many methods have achieved satisfactory performance for the close-set text recognition challenges, these methods lose feasibility in open-set scenarios, where collecting data or retraining models for novel characters could yield a high cost. For example, annotating samples for foreign languages can be expensive, whereas retraining the model each time when a novel character is discovered from historical documents costs both time and resources. In this paper, we introduce and formulate a new open-set text recognition task which demands the capability to spot and recognize novel characters without retraining. A label-to-prototype learning framework is also proposed as a baseline for the proposed task. Specifically, the framework introduces a generalizable label-to-prototype mapping function to build prototypes (class centers) for both seen and unseen classes. An open-set predictor is then utilized to recognize or reject samples according to the prototypes. The implementation of rejection capability over out-of-set characters allows automatic spotting of unknown characters in the incoming data stream. Extensive experiments show that our method achieves promising performance on a variety of zero-shot, close-set, and open-set text recognition datasets

CVMay 13, 2022
Unsupervised Structure-Texture Separation Network for Oracle Character Recognition

Mei Wang, Weihong Deng, Cheng-Lin Liu

Oracle bone script is the earliest-known Chinese writing system of the Shang dynasty and is precious to archeology and philology. However, real-world scanned oracle data are rare and few experts are available for annotation which make the automatic recognition of scanned oracle characters become a challenging task. Therefore, we aim to explore unsupervised domain adaptation to transfer knowledge from handprinted oracle data, which are easy to acquire, to scanned domain. We propose a structure-texture separation network (STSN), which is an end-to-end learning framework for joint disentanglement, transformation, adaptation and recognition. First, STSN disentangles features into structure (glyph) and texture (noise) components by generative models, and then aligns handprinted and scanned data in structure feature space such that the negative influence caused by serious noises can be avoided when adapting. Second, transformation is achieved via swapping the learned textures across domains and a classifier for final classification is trained to predict the labels of the transformed scanned characters. This not only guarantees the absolute separation, but also enhances the discriminative ability of the learned features. Extensive experiments on Oracle-241 dataset show that STSN outperforms other adaptation methods and successfully improves recognition performance on scanned data even when they are contaminated by long burial and careless excavation.

CVAug 29, 2024Code
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

Zhengqing Gao, Xiang Ao, Xu-Yao Zhang et al.

Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting with the test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. Through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. The code is available at https://github.com/gaozhengqing/TTPT

CVAug 4, 2023
Class Incremental Learning with Self-Supervised Pre-Training and Prototype Learning

Wenzhuo Liu, Xinjian Wu, Fei Zhu et al.

Deep Neural Network (DNN) has achieved great success on datasets of closed class set. However, new classes, like new categories of social media topics, are continuously added to the real world, making it necessary to incrementally learn. This is hard for DNN because it tends to focus on fitting to new classes while ignoring old classes, a phenomenon known as catastrophic forgetting. State-of-the-art methods rely on knowledge distillation and data replay techniques but still have limitations. In this work, we analyze the causes of catastrophic forgetting in class incremental learning, which owes to three factors: representation drift, representation confusion, and classifier distortion. Based on this view, we propose a two-stage learning framework with a fixed encoder and an incrementally updated prototype classifier. The encoder is trained with self-supervised learning to generate a feature space with high intrinsic dimensionality, thus improving its transferability and generality. The classifier incrementally learns new prototypes while retaining the prototypes of previously learned data, which is crucial in preserving the decision boundary.Our method does not rely on preserved samples of old classes, is thus a non-exemplar based CIL method. Experiments on public datasets show that our method can significantly outperform state-of-the-art exemplar-based methods when they reserved 5 examplers per class, under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.

LGMar 2, 2023
Average of Pruning: Improving Performance and Stability of Out-of-Distribution Detection

Zhen Cheng, Fei Zhu, Xu-Yao Zhang et al.

Detecting Out-of-distribution (OOD) inputs have been a critical issue for neural networks in the open world. However, the unstable behavior of OOD detection along the optimization trajectory during training has not been explored clearly. In this paper, we first find the performance of OOD detection suffers from overfitting and instability during training: 1) the performance could decrease when the training error is near zero, and 2) the performance would vary sharply in the final stage of training. Based on our findings, we propose Average of Pruning (AoP), consisting of model averaging and pruning, to mitigate the unstable behaviors. Specifically, model averaging can help achieve a stable performance by smoothing the landscape, and pruning is certified to eliminate the overfitting by eliminating redundant features. Comprehensive experiments on various datasets and architectures are conducted to verify the effectiveness of our method.

CVNov 25, 2023
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin et al.

Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. The code will be made public available soon.

CVSep 2, 2024
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Yi Chen, Jian Xu, Xu-Yao Zhang et al.

With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

CVNov 22, 2023
Unified Classification and Rejection: A One-versus-All Framework

Zhen Cheng, Xu-Yao Zhang, Cheng-Lin Liu

Classifying patterns of known classes and rejecting ambiguous and novel (also called as out-of-distribution (OOD)) inputs are involved in open world pattern recognition. Deep neural network models usually excel in closed-set classification while performs poorly in rejecting OOD inputs. To tackle this problem, numerous methods have been designed to perform open set recognition (OSR) or OOD rejection/detection tasks. Previous methods mostly take post-training score transformation or hybrid models to ensure low scores on OOD inputs while separating known classes. In this paper, we attempt to build a unified framework for building open set classifiers for both classification and OOD rejection. We formulate the open set recognition of $ K $-known-class as a $ (K+1) $-class classification problem with model trained on known-class samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all (OVA) binary classification tasks and binding some parameters, we show that combining the scores of OVA classifiers can give $ (K+1) $-class posterior probabilities, which enables classification and OOD rejection in a unified framework. To maintain the closed-set classification accuracy of the OVA trained classifier, we propose a hybrid training strategy combining OVA loss and multi-class cross-entropy loss. We implement the OVA framework and hybrid training strategy on the recently proposed convolutional prototype network and prototype classifier on vision transformer (ViT) backbone. Experiments on popular OSR and OOD detection datasets demonstrate that the proposed framework, using a single multi-class classifier, yields competitive performance in closed-set classification, OOD detection, and misclassification detection.

AIJul 10, 2024
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram

Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin et al.

Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained structural and semantic content of geometry diagram, and fuse diagram with textual problem efficiently through structural-semantic pre-training. For reasoning, we design an explicable solution program to describe the geometric reasoning process, and employ a self-limited decoder to generate solution program autoregressively. To reduce solution errors, a multi-level theorem verifier is proposed to eliminate solutions that do not match geometric principles, alleviating the hallucination of the neural model. We also construct a large-scale geometry problem dataset called PGPS9K, containing fine-grained annotations of textual clauses, solution program and involved knowledge tuples. Extensive experiments on datasets Geometry3K and PGPS9K show that our PGPSNet solver outperforms existing symbolic and neural solvers in GPS performance, while maintaining good explainability and reliability, and the solver components (fusion, reasoning, verification) are all justified effective.

CVJul 19, 2024
PASS++: A Dual Bias Reduction Framework for Non-Exemplar Class-Incremental Learning

Fei Zhu, Xu-Yao Zhang, Zhen Cheng et al.

Class-incremental learning (CIL) aims to recognize new classes incrementally while maintaining the discriminability of old classes. Most existing CIL methods are exemplar-based, i.e., storing a part of old data for retraining. Without relearning old data, those methods suffer from catastrophic forgetting. In this paper, we figure out two inherent problems in CIL, i.e., representation bias and classifier bias, that cause catastrophic forgetting of old knowledge. To address these two biases, we present a simple and novel dual bias reduction framework that employs self-supervised transformation (SST) in input space and prototype augmentation (protoAug) in deep feature space. On the one hand, SST alleviates the representation bias by learning generic and diverse representations that can transfer across different tasks. On the other hand, protoAug overcomes the classifier bias by explicitly or implicitly augmenting prototypes of old classes in the deep feature space, which poses tighter constraints to maintain previously learned decision boundaries. We further propose hardness-aware prototype augmentation and multi-view ensemble strategies, leading to significant improvements. The proposed framework can be easily integrated with pre-trained models. Without storing any samples of old classes, our method can perform comparably with state-of-the-art exemplar-based approaches which store plenty of old data. We hope to draw the attention of researchers back to non-exemplar CIL by rethinking the necessity of storing old samples in CIL.

NEAug 1, 2022
Biologically Plausible Training of Deep Neural Networks Using a Top-down Credit Assignment Network

Jian-Hui Chen, Cheng-Lin Liu, Zuoren Wang

Despite the widespread adoption of Backpropagation algorithm-based Deep Neural Networks, the biological infeasibility of the BP algorithm could potentially limit the evolution of new DNN models. To find a biologically plausible algorithm to replace BP, we focus on the top-down mechanism inherent in the biological brain. Although top-down connections in the biological brain play crucial roles in high-level cognitive functions, their application to neural network learning remains unclear. This study proposes a two-level training framework designed to train a bottom-up network using a Top-Down Credit Assignment Network (TDCA-network). The TDCA-network serves as a substitute for the conventional loss function and the back-propagation algorithm, widely used in neural network training. We further introduce a brain-inspired credit diffusion mechanism, significantly reducing the TDCA-network's parameter complexity, thereby greatly accelerating training without compromising the network's performance.Our experiments involving non-convex function optimization, supervised learning, and reinforcement learning reveal that a well-trained TDCA-network outperforms back-propagation across various settings. The visualization of the update trajectories in the loss landscape indicates the TDCA-network's ability to bypass local minima where BP-based trajectories typically become trapped. The TDCA-network also excels in multi-task optimization, demonstrating robust generalizability across different datasets in supervised learning and unseen task settings in reinforcement learning. Moreover, the results indicate that the TDCA-network holds promising potential to train neural networks across diverse architectures.

AIFeb 24, 2025Code
From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang et al.

Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.

AIDec 18, 2025
PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving

Jianming Liu, Ren Zhu, Jian Xu et al.

Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics-Informed Neural Network (PINNs) introduced end-to-end neural network-based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM-driven agents and introduce PDE-Agent, the first toolchain-augmented multi-agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE-Agent leverages the strengths of multi-agent and multi-tool collaboration through two key innovations: (1) A Prog-Act framework with graph memory for multi-agent collaboration, which enables effective dynamic planning and error correction via dual-loop mechanisms (localized fixes and global revisions). (2) A Resource-Pool integrated with a tool-parameter separation mechanism for multi-tool collaboration. This centralizes the management of runtime artifacts and resolves inter-tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE-Bench, a multi-type PDE Benchmark for agent-based tool collaborative solving, and propose multi-level metrics for assessing tool coordination. Evaluations verify that PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks. This new paradigm of toolchain-augmented multi-agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.

CVAug 17, 2024
StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Jiao Zhang, Jian Xu, Xu-Yao Zhang et al.

In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.

CVSep 12, 2023
Towards Reliable Domain Generalization: A New Dataset and Evaluations

Jiao Zhang, Xu-Yao Zhang, Cheng-Lin Liu

There are ubiquitous distribution shifts in the real world. However, deep neural networks (DNNs) are easily biased towards the training set, which causes severe performance degradation when they receive out-of-distribution data. Many methods are studied to train models that generalize under various distribution shifts in the literature of domain generalization (DG). However, the recent DomainBed and WILDS benchmarks challenged the effectiveness of these methods. Aiming at the problems in the existing research, we propose a new domain generalization task for handwritten Chinese character recognition (HCCR) to enrich the application scenarios of DG method research. We evaluate eighteen DG methods on the proposed PaHCC (Printed and Handwritten Chinese Characters) dataset and show that the performance of existing methods on this dataset is still unsatisfactory. Besides, under a designed dynamic DG setting, we reveal more properties of DG methods and argue that only the leave-one-domain-out protocol is unreliable. We advocate that researchers in the DG community refer to dynamic performance of methods for more comprehensive and reliable evaluation. Our dataset and evaluations bring new perspectives to the community for more substantial progress. We will make our dataset public with the article published to facilitate the study of domain generalization.

CVMar 7, 2024Code
Active Generalized Category Discovery

Shijie Ma, Fei Zhu, Zhun Zhong et al.

Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task, which endeavors to cluster unlabeled samples from both novel and old classes, leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes, and that novel categories are fully unlabeled, GCD inherently faces intractable problems, including imbalanced classification performance and inconsistent confidence between old and new classes, especially in the low-labeling regime. Hence, some annotations of new classes are deemed necessary. However, labeling new classes is extremely costly. To address this issue, we take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD). The goal is to improve the performance of GCD by actively selecting a limited amount of valuable samples for labeling from the oracle. To solve this problem, we devise an adaptive sampling strategy, which jointly considers novelty, informativeness and diversity to adaptively select novel samples with proper uncertainty. However, owing to the varied orderings of label indices caused by the clustering of novel classes, the queried labels are not directly applicable to subsequent training. To overcome this issue, we further propose a stable label mapping algorithm that transforms ground truth labels to the label space of the classifier, thereby ensuring consistent training across different active selection stages. Our method achieves state-of-the-art performance on both generic and fine-grained datasets. Our code is available at https://github.com/mashijie1028/ActiveGCD

CVJul 14, 2024
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

Xinjian Wu, Ruisong Zhang, Jie Qin et al.

Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.

AIDec 24, 2024Code
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

Chao Deng, Jiale Yuan, Pi Bu et al.

Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

CVJan 4, 2024Code
PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning

Haiyang Guo, Fei Zhu, Wenzhuo Liu et al.

Existing federated learning methods have effectively dealt with decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to classify all seen classes. To effectively mitigate catastrophic forgetting and data heterogeneity under low communication costs, we propose a simple and effective method named PILoRA. On the one hand, we adopt prototype learning to learn better feature representations and leverage the heuristic information between prototypes and class features to design a prototype re-weight module to solve the classifier bias caused by data heterogeneity without retraining the classifier. On the other hand, we view incremental learning as the process of learning distinct task vectors and encoding them within different LoRA parameters. Accordingly, we propose Incremental LoRA to mitigate catastrophic forgetting. Experimental results on standard datasets indicate that our method outperforms the state-of-the-art approaches significantly. More importantly, our method exhibits strong robustness and superiority in different settings and degrees of data heterogeneity. The code is available at \url{https://github.com/Ghy0501/PILoRA}.

CVMar 5, 2024Code
Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Fei Zhu, Xu-Yao Zhang, Zhen Cheng et al.

Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}.

CVApr 9, 2024Code
Unified Entropy Optimization for Open-Set Test-Time Adaptation

Zhengqing Gao, Xu-Yao Zhang, Cheng-Lin Liu

Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain. Existing methods usually focus on improving TTA performance under covariate shifts, while neglecting semantic shifts. In this paper, we delve into a realistic open-set TTA setting where the target domain may contain samples from unknown classes. Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios, which can be attributed to the inaccurate estimation of data distribution and model confidence. To address these issues, we propose a simple but effective framework called unified entropy optimization (UniEnt), which is capable of simultaneously adapting to covariate-shifted in-distribution (csID) data and detecting covariate-shifted out-of-distribution (csOOD) data. Specifically, UniEnt first mines pseudo-csID and pseudo-csOOD samples from test data, followed by entropy minimization on the pseudo-csID data and entropy maximization on the pseudo-csOOD data. Furthermore, we introduce UniEnt+ to alleviate the noise caused by hard data partition leveraging sample-level confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show the superiority of our framework. The code is available at https://github.com/gaozhengqing/UniEnt

LGDec 23, 2025
QE-Catalytic: A Graph-Language Multimodal Base Model for Relaxed-Energy Prediction in Catalytic Adsorption

Yanjie Li, Jian Xu, Xueqing Chen et al.

Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph -- thereby broadening applicability -- remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing ``the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa. To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects ``3D geometric information'' into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.

LGApr 2, 2025Code
ProtoGCD: Unified and Unbiased Prototype Learning for Generalized Category Discovery

Shijie Ma, Fei Zhu, Xu-Yao Zhang et al.

Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, {enabling unified modeling between old and new classes}. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at https://github.com/mashijie1028/ProtoGCD.

CVApr 16
Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

Peijie Wang, Ming-Liang Zhang, Jun Cao et al.

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs' capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

CLMar 17, 2025Code
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang et al.

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.

LGMar 17, 2025Code
Federated Continual Instruction Tuning

Haiyang Guo, Fanhu Zeng, Fei Zhu et al.

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.

CLJun 3, 2025Code
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Zhong-Zhi Li, Xiao Liang, Zihao Tang et al.

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

LGJun 16, 2025Code
Continual Learning for Generative AI: From LLMs to MLLMs and Beyond

Haiyang Guo, Fanhu Zeng, Fei Zhu et al.

The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.

CVApr 5, 2025Code
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

Xiao-Hui Li, Fei Yin, Cheng-Lin Liu

Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.

CLJun 28, 2024Code
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin et al.

Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.

CVMay 13, 2023Code
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu, Zhang Li, Mingxin Huang et al.

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

LGMar 4, 2024
Open-world machine learning: A review and new outlooks

Fei Zhu, Shijie Ma, Zhen Cheng et al.

Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In such cases, rejecting unknowns, discovering novelties, and then continually learning them, could enable models to be safe and evolve continually as biological systems do. This article presents a holistic view of open-world machine learning by investigating unknown rejection, novelty discovery, and continual learning in a unified paradigm. The challenges, principles, and limitations of current methodologies are discussed in detail. Furthermore, widely used benchmarks, metrics, and performances are summarized. Finally, we discuss several potential directions for further progress in the field. By providing a comprehensive introduction to the emerging open-world machine learning paradigm, this article aims to help researchers build more powerful AI systems in their respective fields, and to promote the development of artificial general intelligence.

AIFeb 28, 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

Peijie Wang, Zhong-Zhi Li, Fei Yin et al.

Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs' mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.

CVMar 11, 2024
Ensemble Quadratic Assignment Network for Graph Matching

Haoru Tan, Chuang Wang, Sitong Wu et al.

Graph matching is a commonly used technique in computer vision and pattern recognition. Recent data-driven approaches have improved the graph matching accuracy remarkably, whereas some traditional algorithm-based methods are more robust to feature noises, outlier nodes, and global transformation (e.g.~rotation). In this paper, we propose a graph neural network (GNN) based approach to combine the advantages of data-driven and traditional methods. In the GNN framework, we transform traditional graph-matching solvers as single-channel GNNs on the association graph and extend the single-channel architecture to the multi-channel network. The proposed model can be seen as an ensemble method that fuses multiple algorithms at every iteration. Instead of averaging the estimates at the end of the ensemble, in our approach, the independent iterations of the ensembled algorithms exchange their information after each iteration via a 1x1 channel-wise convolution layer. Experiments show that our model improves the performance of traditional algorithms significantly. In addition, we propose a random sampling strategy to reduce the computational complexity and GPU memory usage, so the model applies to matching graphs with thousands of nodes. We evaluate the performance of our method on three tasks: geometric graph matching, semantic feature matching, and few-shot 3D shape classification. The proposed model performs comparably or outperforms the best existing GNN-based methods.

CVJun 10, 2025
LLaVA-c: Continual Improved Visual Instruction Tuning

Wenzhuo Liu, Fei Zhu, Haiyang Guo et al.

Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.

CVNov 18, 2024
A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond

Jing Li, Xueke Chi, Qiufeng Wang et al.

Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.

CVMar 27, 2024
Towards Non-Exemplar Semi-Supervised Class-Incremental Learning

Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu

Deep neural networks perform remarkably well in close-world scenarios. However, novel classes emerged continually in real applications, making it necessary to learn incrementally. Class-incremental learning (CIL) aims to gradually recognize new classes while maintaining the discriminability of old ones. Existing CIL methods have two limitations: a heavy reliance on preserving old data for forgetting mitigation and the need for vast labeled data for knowledge adaptation. To overcome these issues, we propose a non-exemplar semi-supervised CIL framework with contrastive learning and semi-supervised incremental prototype classifier (Semi-IPC). On the one hand, contrastive learning helps the model learn rich representations, easing the trade-off between learning representations of new classes and forgetting that of old classes. On the other hand, Semi-IPC learns a prototype for each class with unsupervised regularization, enabling the model to incrementally learn from partially labeled new data while maintaining the knowledge of old classes. Experiments on benchmark datasets demonstrate the strong performance of our method: without storing any old samples and only using less than 1% of labels, Semi-IPC outperforms advanced exemplar-based methods. We hope our work offers new insights for future CIL research. The code will be made publicly available.

LGMar 27, 2024
Branch-Tuning: Balancing Stability and Plasticity for Continual Self-Supervised Learning

Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu

Self-supervised learning (SSL) has emerged as an effective paradigm for deriving general representations from vast amounts of unlabeled data. However, as real-world applications continually integrate new content, the high computational and resource demands of SSL necessitate continual learning rather than complete retraining. This poses a challenge in striking a balance between stability and plasticity when adapting to new information. In this paper, we employ Centered Kernel Alignment for quantitatively analyzing model stability and plasticity, revealing the critical roles of batch normalization layers for stability and convolutional layers for plasticity. Motivated by this, we propose Branch-tuning, an efficient and straightforward method that achieves a balance between stability and plasticity in continual SSL. Branch-tuning consists of branch expansion and compression, and can be easily applied to various SSL methods without the need of modifying the original methods, retaining old data or models. We validate our method through incremental experiments on various benchmark datasets, demonstrating its effectiveness and practical value in real-world scenarios. We hope our work offers new insights for future continual self-supervised learning research. The code will be made publicly available.

AIApr 7, 2025
Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making

Zhuoning Xu, Jian Xu, Mingqing Zhang et al.

As a strategic pillar industry for human survival and development, modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development. Against the backdrop of intensified climate change leading to frequent extreme weather events, the uncertainty risks in agricultural production systems are increasing exponentially. To address these challenges, this study proposes an innovative \textbf{M}ultimodal \textbf{A}gricultural \textbf{A}gent \textbf{A}rchitecture (\textbf{MA3}), which leverages cross-modal information fusion and task collaboration mechanisms to achieve intelligent agricultural decision-making. This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation. We propose a unified backbone for sugarcane disease classification and detection tools, as well as a sugarcane disease expert model. By integrating an innovative tool selection module, we develop a multimodal agricultural agent capable of effectively performing tasks in classification, detection, and VQA. Furthermore, we introduce a multi-dimensional quantitative evaluation framework and conduct a comprehensive assessment of the entire architecture over our evaluation dataset, thereby verifying the practicality and robustness of MA3 in agricultural scenarios. This study provides new insights and methodologies for the development of agricultural agents, holding significant theoretical and practical implications. Our source code and dataset will be made publicly available upon acceptance.

CVMar 19, 2025
A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal et al.

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.

CLOct 27, 2025
MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Xinming Wang, Jian Xu, Bin Yu et al.

Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.