99.0CVMar 19Code
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsYibo Shi, Jungang Li, Linghao Zhang et al.
Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).
CLDec 29, 2025Code
MiMo-Audio: Audio Language Models are Few-Shot LearnersXiaomi LLM-Core Team, Dong Zhang, Gang Wang et al.
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.
15.7NAApr 25
Rank One Completion for Higher Order TensorsLinghao Zhang, Ioana Dumitriu, Jiawang Nie
We study the rank one completion problem for tensors of arbitrary orders. The notion of rank one determinable tensors is introduced. We explore its properties and propose a recursive algorithm for computing rank one tensor completion. This algorithm only requires solving linear systems and computing singular vectors. In the absence of noise, it produces a unique rank one completion under some assumptions. In the presence of noise, we show that the computed rank one tensor completion is close to the exact one when the noise is sufficiently small. Numerical experiments demonstrate the efficiency and accuracy of the proposed method.
84.4CVMar 18
Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language ModelsLinghao Zhang, Jungang Li, Yonghua Hei et al.
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
50.0SEApr 27
CoRaCMG: Contextual Retrieval-Augmented Framework for Commit Message GenerationBo Xiong, Linghao Zhang, Zongen Ren et al.
Commit messages play a key role in documenting the intent behind code changes. However, they are often low-quality, vague, or incomplete, limiting their usefulness. Commit Message Generation (CMG) aims to automatically generate descriptive commit messages from code diffs to reduce developers' effort and improve message quality. Although recent advances in LLMs have shown promise in automating CMG, their performance remains limited. This paper aims to enhance CMG performance by retrieving similar diff-message pairs to guide LLMs to generate commit messages that are more precise and informative. We proposed CoRaCMG, a Contextual Retrieval-augmented framework for Commit Message Generation, structured in three phases: (1) Retrieve: retrieving the similar diff-message pairs; (2) Augment: combining them with the query diff into a structured prompt; and (3) Generate: generating commit messages corresponding to the query diff via LLMs. CoRaCMG enables LLMs to learn project-specific terminologies and writing styles from the retrieved diff-message pairs. We evaluated CoRaCMG across multiple LLMs (e.g., GPT, DeepSeek, and Qwen) and compared its performance against SOTA baselines. Experimental results show that CoRaCMG significantly boosts LLM performance across four metrics (BLEU, Rouge-L, METEOR, and CIDEr). Specifically, DeepSeek-R1 achieves relative improvements of 76% in BLEU and 71% in CIDEr when augmented with a single retrieved example pair. After incorporating the single example pair, GPT-4o achieves the highest improvement rate, with BLEU increasing by 89%. Moreover, performance gains plateau after more than three examples are used, indicating diminishing returns. Further analysis shows that the improvements are attributed to the model's ability to capture the terminologies and writing styles of human-written commit messages from the retrieved example pairs.
CVMay 4, 2025Code
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoShuhang Xun, Sicheng Tao, Jungang Li et al.
Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.
63.9NIMar 17
Agentic AI for SAGIN Resource Management_Semantic Awareness, Orchestration, and OptimizationLinghao Zhang, Haitao Zhao, Bo Xu et al.
Space-air-ground integrated networks (SAGIN) promise ubiquitous 6G connectivity but face significant resource management challenges due to heterogeneous infrastructure, dynamic topologies, and stringent quality-of-service (QoS) requirements. Conventional model-driven approaches struggle with scalability and adaptability in such complex environments. This paper presents an agentic artificial intelligence (AI) framework for autonomous SAGIN resource management by embedding large language model (LLM)-based agents into a Monitor-Analyze-Plan- Execute-Knowledge (MAPE-K) control plane. The framework incorporates three specialized agents, namely semantic resource perceivers, intent-driven orchestrators, and adaptive learners, that collaborate through natural language reasoning to bridge the gap between operator intents and network execution. A key innovation is the hierarchical agent-reinforcement learning (RL) collaboration mechanism, wherein LLM-based orchestrators dynamically shape reward functions for RL agents based on semantic network conditions. Validation through UAV-assisted AIGC service orchestration in energy-constrained scenarios demonstrates that LLM-driven reward shaping achieves 14% energy reduction and the lowest average service latency among all compared methods. This agentic paradigm offers a scalable pathway toward adaptive, AI-native 6G networks, capable of autonomously interpreting intents and adapting to dynamic environments.
51.5AIMar 11
Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge CrystallizationLinghao Zhang
The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms -- code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts -- both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.
SEMay 29, 2025
SWE-bench Goes Live!Linghao Zhang, Shilin He, Chaoyun Zhang et al.
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
CLJan 23, 2025
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at ScaleLinghao Zhang, Junhao Wang, Shilin He et al.
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
SEMar 5
RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY PlatformKenan Li, Rongzhi Li, Linghao Zhang et al.
Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset creation, where task design is the only human intervention. RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs. Notably, several works on agentic benchmarking and training have recently adopted RepoLaunch for automated task generation.
DCJan 20
Device Association and Resource Allocation for Hierarchical Split Federated Learning in Space-Air-Ground Integrated NetworkHaitao Zhao, Xiaoyu Tang, Bo Xu et al.
6G facilitates deployment of Federated Learning (FL) in the Space-Air-Ground Integrated Network (SAGIN), yet FL confronts challenges such as resource constrained and unbalanced data distribution. To address these issues, this paper proposes a Hierarchical Split Federated Learning (HSFL) framework and derives its upper bound of loss function. To minimize the weighted sum of training loss and latency, we formulate a joint optimization problem that integrates device association, model split layer selection, and resource allocation. We decompose the original problem into several subproblems, where an iterative optimization algorithm for device association and resource allocation based on brute-force split point search is proposed. Simulation results demonstrate that the proposed algorithm can effectively balance training efficiency and model accuracy for FL in SAGIN.
CYNov 26, 2025
AI Urban Scientist: Multi-Agent Collaborative Automation for Urban ResearchTong Xia, Jiankun Zhang, Ruiwen You et al.
Urban research aims to understand how cities operate and evolve as complex adaptive systems. With the rapid growth of urban data and analytical methodologies, the central challenge of the field has shifted from data availability to the integration of heterogeneous data into coherent, verifiable urban knowledge through multidisciplinary approaches. Recent advances in AI, particularly the emergence of large language models (LLMs), have enabled the development of AI scientists capable of autonomous reasoning, hypothesis generation, and data-driven experimentation, demonstrating substantial potential for autonomous urban research. However, most general-purpose AI systems remain misaligned with the domain-specific knowledge, methodological conventions, and inferential standards required in urban studies. Here, we introduce the AI Urban Scientist, a knowledge-driven multi-agent framework designed to support autonomous urban research. Grounded in hypotheses, peer-review feedback, datasets, and research methodologies distilled from large-scale prior studies, the system constructs structured domain knowledge that guides LLM-based agents to automatically generate hypotheses, identify and integrate multi-source urban datasets, conduct empirical analyses and simulations, and iteratively refine analytical methods. Through this process, the framework synthesizes new insights in urban science and accelerates the urban research lifecycle.
AIAug 23, 2025
Solving the Min-Max Multiple Traveling Salesmen Problem via Learning-Based Path Generation and Optimal SplittingWen Wang, Xiangchen Wu, Liang Wang et al.
This study addresses the Min-Max Multiple Traveling Salesmen Problem ($m^3$-TSP), which aims to coordinate tours for multiple salesmen such that the length of the longest tour is minimized. Due to its NP-hard nature, exact solvers become impractical under the assumption that $P \ne NP$. As a result, learning-based approaches have gained traction for their ability to rapidly generate high-quality approximate solutions. Among these, two-stage methods combine learning-based components with classical solvers, simplifying the learning objective. However, this decoupling often disrupts consistent optimization, potentially degrading solution quality. To address this issue, we propose a novel two-stage framework named \textbf{Generate-and-Split} (GaS), which integrates reinforcement learning (RL) with an optimal splitting algorithm in a joint training process. The splitting algorithm offers near-linear scalability with respect to the number of cities and guarantees optimal splitting in Euclidean space for any given path. To facilitate the joint optimization of the RL component with the algorithm, we adopt an LSTM-enhanced model architecture to address partial observability. Extensive experiments show that the proposed GaS framework significantly outperforms existing learning-based approaches in both solution quality and transferability.
CVJul 23, 2018
Question Relevance in Visual Question AnsweringPrakruthi Prabhakar, Nitish Kulkarni, Linghao Zhang
Free-form and open-ended Visual Question Answering systems solve the problem of providing an accurate natural language answer to a question pertaining to an image. Current VQA systems do not evaluate if the posed question is relevant to the input image and hence provide nonsensical answers when posed with irrelevant questions to an image. In this paper, we solve the problem of identifying the relevance of the posed question to an image. We address the problem as two sub-problems. We first identify if the question is visual or not. If the question is visual, we then determine if it's relevant to the image or not. For the second problem, we generate a large dataset from existing visual question answering datasets in order to enable the training of complex architectures and model the relevance of a visual question to an image. We also compare the results of our Long Short-Term Memory Recurrent Neural Network based models to Logistic Regression, XGBoost and multi-layer perceptron based approaches to the problem.