h-index22
18papers
395citations
Novelty53%
AI Score59

18 Papers

96.8CVMay 26Code
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Yifan Sui, Xin Huang, Hongbing Li et al.

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

CVJan 14Code
STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han et al.

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

SYNov 9, 2017
Coordinated trajectory tracking of multiple vertical take-off and landing UAVs

Yao Zou, Ziyang Meng

This paper investigates the coordinated trajectory tracking problem of multiple vertical takeooff and landing (VTOL) unmanned aerial vehicles (UAVs). The case of unidirectional information flow is considered and the objective is to drive all the follower VTOL UAVs to accurately track the trajectory of the leader. Firstly, a novel distributed estimator is developed for each VTOL UAV to obtain the leader's desired information asymptotically. With the outputs of the estimators, the solution to the coordinated trajectory tracking problem of multiple VTOL UAVs is transformed to individually solving the tracking problem of each VTOL UAV. Due to the under-actuated nature of the VTOL UAV, a hierarchical framework is introduced for each VTOL UAV such that a command force and an applied torque are exploited in sequence, then the position tracking to the estimated desired position and the attitude tracking to the command attitude are achieved. Moreover, an auxiliary system with proper parameters is implemented to guarantee the singularity-free command attitude extraction and to obviate the use of the unavailable desired information. The stability analysis and simulations effectively validate the achievement of the coordinated trajectory tracking of multiple VTOL UAVs with the proposed control approach.

SYMay 12, 2017
A State Transition Matrix-Based Approach to Separation of Cooperations and Antagonisms in Opinion Dynamics

Deyuan Meng, Ziyang Meng, Yiguang Hong

This paper is concerned with the dynamics evolution of opinions in the presence of both cooperations and antagonisms. The class of Laplacian flows is addressed through signed digraphs subject to switching topologies. Further, a state transition matrix-based approach is developed for the analysis of opinion dynamics, regardless of any assumptions on connectivity, structural balance or digon sign-symmetry of signed digraphs. It is shown that based on the separation of cooperations and antagonisms, a relationship can be bridged between opinion dynamics under signed digraphs and under conventional digraphs. This helps to solve convergence problems for opinion dynamics. In particular, bipartite consensus (or stability) emerges if and only if the associated switching signed digraph is simultaneously structurally balanced (or unbalanced), which generalizes the use of structural balance theory in opinion dynamics to the case study of changing network topologies.

AIFeb 9Code
GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li, Jingwei Wu, Quan Sun et al.

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

CVJan 26Code
HomoFM: Deep Homography Estimation with Flow Matching

Mengfan He, Liangzheng Sun, Chunyu Li et al.

Deep homography estimation has broad applications in computer vision and robotics. Remarkable progresses have been achieved while the existing methods typically treat it as a direct regression or iterative refinement problem and often struggling to capture complex geometric transformations or generalize across different domains. In this work, we propose HomoFM, a new framework that introduces the flow matching technique from generative modeling into the homography estimation task for the first time. Unlike the existing methods, we formulate homography estimation problem as a velocity field learning problem. By modeling a continuous and point-wise velocity field that transforms noisy distributions into registered coordinates, the proposed network recovers high-precision transformations through a conditional flow trajectory. Furthermore, to address the challenge of domain shifts issue, e.g., the cases of multimodal matching or varying illumination scenarios, we integrate a gradient reversal layer (GRL) into the feature extraction backbone. This domain adaptation strategy explicitly constrains the encoder to learn domain-invariant representations, significantly enhancing the network's robustness. Extensive experiments demonstrate the effectiveness of the proposed method, showing that HomoFM outperforms state-of-the-art methods in both estimation accuracy and robustness on standard benchmarks. Code and data resource are available at https://github.com/hmf21/HomoFM.

CVFeb 11Code
(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

Minglei Li, Mengfan He, Chao Chen et al.

Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.

CVAug 14, 2025Code
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team, Chunrui Han, Guopeng Li et al. · tsinghua

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

48.6CVMar 10
RAE-NWM: Navigation World Model in Dense Visual Representation Space

Mingkun Zhang, Wangtian Shen, Fan Zhang et al.

Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

39.7CVMar 13Code
CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images

Liangzheng Sun, Mengfan He, Xingyu Shao et al.

Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: https://github.com/SLZ98/CM-Bench.

CVApr 4, 2025
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin et al.

Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.

43.7ROApr 10
Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments

Xingyu Shao, Zhiqiang Yan, Liangzheng Sun et al.

Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our "Learn-and-Dispose" pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.

CVDec 17, 2025
Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang et al.

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

ROSep 17, 2025
Prompt2Auto: From Motion Prompt to Automated Control via Geometry-Invariant One-Shot Gaussian Process Learning

Zewen Yang, Xiaobing Dai, Dongfa Zhang et al.

Learning from demonstration allows robots to acquire complex skills from human demonstrations, but conventional approaches often require large datasets and fail to generalize across coordinate transformations. In this paper, we propose Prompt2Auto, a geometry-invariant one-shot Gaussian process (GeoGP) learning framework that enables robots to perform human-guided automated control from a single motion prompt. A dataset-construction strategy based on coordinate transformations is introduced that enforces invariance to translation, rotation, and scaling, while supporting multi-step predictions. Moreover, GeoGP is robust to variations in the user's motion prompt and supports multi-skill autonomy. We validate the proposed approach through numerical simulations with the designed user graphical interface and two real-world robotic experiments, which demonstrate that the proposed method is effective, generalizes across tasks, and significantly reduces the demonstration burden. Project page is available at: https://prompt2auto.github.io

CVJun 20, 2024
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning

Ziyang Meng, Yu Dai, Zezheng Gong et al.

Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.

SYJun 4, 2017
Persistent Flows and Non-Reciprocal Interactions in Deterministic Networks

Weiguo Xia, Guodong Shi, Ziyang Meng et al.

This paper studies deterministic consensus networks with discrete-time dynamics under persistent flows and non-reciprocal agent interactions. An arc describing the interaction strength between two agents is said to be persistent if its weight function has an infinite $l_1$ norm. We discuss two balance conditions on the interactions between agents which generalize the arc-balance and cut-balance conditions in the literature respectively. The proposed conditions require that such a balance should be satisfied over each time window of a fixed length instead of at each time instant. We prove that in both cases global consensus is reached if and only if the persistent graph, which consists of all the persistent arcs, contains a directed spanning tree. The convergence rates of the system to consensus are also provided in terms of the interactions between agents having taken place. The results are obtained under a weak condition without assuming the existence of a positive lower bound of all the nonzero weights of arcs and are compared with the existing results. Illustrative examples are provided to show the critical importance of the nontrivial lower boundedness of the self-confidence of the agents.

SYAug 24, 2015
Network Synchronization with Nonlinear Dynamics and Switching Interactions

Tao Yang, Ziyang Meng, Guodong Shi et al.

This paper considers the synchronization problem for networks of coupled nonlinear dynamical systems under switching communication topologies. Two types of nonlinear agent dynamics are considered. The first one is non-expansive dynamics (stable dynamics with a convex Lyapunov function $φ(\cdot)$) and the second one is dynamics that satisfies a global Lipschitz condition. For the non-expansive case, we show that various forms of joint connectivity for communication graphs are sufficient for networks to achieve global asymptotic $φ$-synchronization. We also show that $φ$-synchronization leads to state synchronization provided that certain additional conditions are satisfied. For the globally Lipschitz case, unlike the non-expansive case, joint connectivity alone is not sufficient for achieving synchronization. A sufficient condition for reaching global exponential synchronization is established in terms of the relationship between the global Lipschitz constant and the network parameters. We also extend the results to leader-follower networks.

SYMar 23, 2015
Sampled-Data Consensus over Random Networks

Junfeng Wu, Ziyang Meng, Tao Yang et al.

This paper considers the consensus problem for a network of nodes with random interactions and sampled-data control actions. We first show that consensus in expectation, in mean square, and almost surely are equivalent for a general random network model when the inter-sampling interval and network size satisfy a simple relation. The three types of consensus are shown to be simultaneously achieved over an independent or a Markovian random network defined on an underlying graph with a directed spanning tree. For both independent and Markovian random network models, necessary and sufficient conditions for mean-square consensus are derived in terms of the spectral radius of the corresponding state transition matrix. These conditions are then interpreted as the existence of critical value on the inter-sampling interval, below which global mean-square consensus is achieved and above which the system diverges in mean-square sense for some initial states. Finally, we establish an upper bound on the inter-sampling interval below which almost sure consensus is reached, and a lower bound on the inter-sampling interval above which almost sure divergence is reached. Some numerical simulations are given to validate the theoretical results and some discussions on the critical value of the inter-sampling intervals for the mean-square consensus are provided.