CLFeb 2Code
Kimi K2.5: Visual Agentic IntelligenceKimi Team, Tongtong Bai, Yifan Bai et al.
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
CVDec 30, 2025
GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical EvaluationYuan Feng, Yue Yang, Xiaohan He et al.
Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.
LGJan 8
Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable RewardJianlong Chen, Daocheng Fu, Shengze Xu et al.
Multimodal Large Language Models (MLLMs) struggle with complex geometric reasoning, largely because "black box" outcome-based supervision fails to distinguish between lucky guesses and rigorous deduction. To address this, we introduce a paradigm shift towards subgoal-level evaluation and learning. We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine, which converts abstract proofs into verifiable numeric subgoals. This structure reveals a critical divergence between reasoning quality and outcome accuracy. Leveraging this, we propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate. Experiments demonstrate that SGVR not only enhances geometric performance (+9.7%) but also exhibits strong generalization, transferring gains to general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.
CLJul 7, 2024
Advancing Prompt Recovery in NLP: A Deep Dive into the Integration of Gemma-2b-it and Phi2 ModelsJianlong Chen, Wei Xu, Zhicheng Ding et al.
Prompt recovery, a crucial task in natural language processing, entails the reconstruction of prompts or instructions that language models use to convert input text into a specific output. Although pivotal, the design and effectiveness of prompts represent a challenging and relatively untapped field within NLP research. This paper delves into an exhaustive investigation of prompt recovery methodologies, employing a spectrum of pre-trained language models and strategies. Our study is a comparative analysis aimed at gauging the efficacy of various models on a benchmark dataset, with the goal of pinpointing the most proficient approach for prompt recovery. Through meticulous experimentation and detailed analysis, we elucidate the outstanding performance of the Gemma-2b-it + Phi2 model + Pretrain. This model surpasses its counterparts, showcasing its exceptional capability in accurately reconstructing prompts for text transformation tasks. Our findings offer a significant contribution to the existing knowledge on prompt recovery, shedding light on the intricacies of prompt design and offering insightful perspectives for future innovations in text rewriting and the broader field of natural language processing.
LGApr 27, 2025Code
Hierarchical Attention Generates Better ProofsJianlong Chen, Chao Li, Yang Yuan et al.
Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.
LGJun 24, 2024Code
MD tree: a model-diagnostic tree grown on loss landscapeYefan Zhou, Jianlong Chen, Qinxue Cao et al.
This paper considers "model diagnosis", which we formulate as a classification problem. Given a pre-trained neural network (NN), the goal is to predict the source of failure from a set of failure modes (such as a wrong hyperparameter, inadequate model size, and insufficient data) without knowing the training configuration of the pre-trained NN. The conventional diagnosis approach uses training and validation errors to determine whether the model is underfitting or overfitting. However, we show that rich information about NN performance is encoded in the optimization loss landscape, which provides more actionable insights than validation-based measurements. Therefore, we propose a diagnosis method called MD tree based on loss landscape metrics and experimentally demonstrate its advantage over classical validation-based approaches. We verify the effectiveness of MD tree in multiple practical scenarios: (1) use several models trained on one dataset to diagnose a model trained on another dataset, essentially a few-shot dataset transfer problem; (2) use small models (or models trained with small data) to diagnose big models (or models trained with big data), essentially a scale transfer problem. In a dataset transfer task, MD tree achieves an accuracy of 87.7%, outperforming validation-based approaches by 14.88%. Our code is available at https://github.com/YefanZhou/ModelDiagnosis.
AIApr 22, 2025Code
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem SolvingDaocheng Fu, Jianlong Chen, Renqiu Xia et al.
Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83\% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen
LGMar 10
MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-TuningYiyang Lu, Yu He, Jianlong Chen et al.
Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.
CLApr 26, 2024
Text Sentiment Analysis and Classification Based on Bidirectional Gated Recurrent Units (GRUs) ModelWei Xu, Jianlong Chen, Zhicheng Ding et al.
This paper explores the importance of text sentiment analysis and classification in the field of natural language processing, and proposes a new approach to sentiment analysis and classification based on the bidirectional gated recurrent units (GRUs) model. The study firstly analyses the word cloud model of the text with six sentiment labels, and then carries out data preprocessing, including the steps of removing special symbols, punctuation marks, numbers, stop words and non-alphabetic parts. Subsequently, the data set is divided into training set and test set, and through model training and testing, it is found that the accuracy of the validation set is increased from 85% to 93% with training, which is an increase of 8%; at the same time, the loss value of the validation set decreases from 0.7 to 0.1 and tends to be stable, and the model is gradually close to the actual value, which can effectively classify the text emotions. The confusion matrix shows that the accuracy of the model on the test set reaches 94.8%, the precision is 95.9%, the recall is 99.1%, and the F1 score is 97.4%, which proves that the model has good generalisation ability and classification effect. Overall, the study demonstrated an effective method for text sentiment analysis and classification with satisfactory results.
CVMay 18, 2024
Towards SAR Automatic Target Recognition MultiCategory SAR Image Classification Based on Light Weight Vision TransformerGuibin Zhao, Pengfei Li, Zhibo Zhang et al.
Synthetic Aperture Radar has been extensively used in numerous fields and can gather a wealth of information about the area of interest. This large scene data intensive technology puts a high value on automatic target recognition which can free the utilizers and boost the efficiency. Recent advances in artificial intelligence have made it possible to create a deep learning based SAR ATR that can automatically identify target features from massive input data. In the last 6 years, intensive research has been conducted in this area, however, most papers in the current SAR ATR field used recurrent neural network and convolutional neural network varied models to deepen the regime's understanding of the SAR images. To equip SAR ATR with updated deep learning technology, this paper tries to apply a lightweight vision transformer based model to classify SAR images. The entire structure was verified by an open-accessed SAR data set and recognition results show that the final classification outcomes are robust and more accurate in comparison with referred traditional network structures without even using any convolutional layers.