Yufei Zheng

CV
h-index11
8papers
19citations
Novelty53%
AI Score56

8 Papers

CVMar 23Code
StreamingClaw Technical Report

Jiawei Chen, Zhe Chen, Chaoqun Du et al.

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

CVMay 25
Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

Yufei Zheng, Xuhan Zhu, Zide Liu et al.

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

CVMay 12Code
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

Chenfeng Wang, Wei He, Xuhan Zhu et al.

In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

CVDec 2, 2025Code
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen, Chaoqun Du, Feng Gu et al.

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

QUANT-PHApr 28
Quantum-enhanced Network Tomography

Yufei Zheng, Zihao Gong, Saikat Guha et al.

Network tomography refers to the use of inference techniques for inferring internal network states from end-to-end probes. Quantum probes, implemented by sending blocks of $n$ coherent-state pulses augmented with continuous-variable (CV) squeezing ($n=1$) or weak temporal-mode entanglement ($n>1$) over a lossy channel to a receiver with homodyne detection capabilities, are known to carry information about the channel transmissivity. Assuming a subset of nodes in an optical network is capable of sending and receiving such probes through intermediate nodes with all-optical switching capabilities, we leverage these quantum probes to estimate link transmissivities. To determine how to route the probes in a network, we propose a probe construction algorithm that guarantees link identifiability, while maximizing the number of information orthogonal sets of transmissivities. A set of probes induces a Fisher information matrix (FIM). We then derive two metrics, the determinant of the FIM and the trace of its inverse, to evaluate the performance of the probes. In particular, our results can be used to characterize the quantum improvement in estimating link transmissivities in a general optical network.

CLOct 13, 2025
Do LLMs "Feel"? Emotion Circuits Discovery and Control

Chenxi Wang, Yixuan Zhang, Ruiji Yu et al.

As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

CVAug 6, 2025
Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Yansheng Gao, Yufei Zheng, Shengsheng Wang

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

CVJul 11, 2025
Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification

Yufei Zheng, Wenjun Wang, Wenjun Gan et al.

Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.