80.6ROJun 4
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware UnderstandingQize Yu, Jiadi You, Yuran Wang et al.
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
99.9SEMar 26Code
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web TestingFanheng Kong, Jingyuan Zhang, Yang Yue et al.
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
LGNov 28, 2023
Attentional Graph Neural Network Is All You Need for Robust Massive Network LocalizationWenzhong Yan, Feng Yin, Juntao Wang et al.
In this paper, we design Graph Neural Networks (GNNs) with attention mechanisms to tackle an important yet challenging nonlinear regression problem: massive network localization. We first review our previous network localization method based on Graph Convolutional Network (GCN), which can exhibit state-of-the-art localization accuracy, even under severe Non-Line-of-Sight (NLOS) conditions, by carefully preselecting a constant threshold for determining adjacency. As an extension, we propose a specially designed Attentional GNN (AGNN) model to resolve the sensitive thresholding issue of the GCN-based method and enhance the underlying model capacity. The AGNN comprises an Adjacency Learning Module (ALM) and Multiple Graph Attention Layers (MGAL), employing distinct attention architectures to systematically address the demerits of the GCN-based method, rendering it more practical for real-world applications. Comprehensive analyses are conducted to explain the superior performance of these methods, including a theoretical analysis of the AGNN's dynamic attention property and computational complexity, along with a systematic discussion of their robust characteristic against NLOS measurements. Extensive experimental results demonstrate the effectiveness of the GCN-based and AGNN-based network localization methods. Notably, integrating attention mechanisms into the AGNN yields substantial improvements in localization accuracy, approaching the fundamental lower bound and showing approximately 37\% to 53\% reduction in localization error compared to the vanilla GCN-based method across various NLOS noise configurations. Both methods outperform all competing approaches by far in terms of localization accuracy, robustness, and computational time, especially for considerably large network sizes.
LGJan 24, 2023
Koopman neural operator as a mesh-free solver of non-linear partial differential equationsWei Xiong, Xiaomeng Huang, Ziyang Zhang et al.
The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to a series of computational techniques for numerical solutions. Although numerous latest advances are accomplished in developing neural operators, a kind of neural-network-based PDE solver, these solvers become less accurate and explainable while learning long-term behaviors of non-linear PDE families. In this paper, we propose the Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of the target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional operator governing all possible observations of the dynamic system, to act on the flow mapping of the dynamic system, we can equivalently learn the solution of a non-linear PDE family by solving simple linear prediction problems. We validate the KNO in mesh-independent, long-term, and5zero-shot predictions on five representative PDEs (e.g., the Navier-Stokes equation and the Rayleigh-B{é}nard convection) and three real dynamic systems (e.g., global water vapor patterns and western boundary currents). In these experiments, the KNO exhibits notable advantages compared with previous state-of-the-art models, suggesting the potential of the KNO in supporting diverse science and engineering applications (e.g., PDE solving, turbulence modelling, and precipitation forecasting).
CVAug 19, 2022
Video Interpolation by Event-driven Anisotropic Adjustment of Optical FlowSong Wu, Kaichao You, Weihua He et al.
Video frame interpolation is a challenging task due to the ever-changing real-world scene. Previous methods often calculate the bi-directional optical flows and then predict the intermediate optical flows under the linear motion assumptions, leading to isotropic intermediate flow generation. Follow-up research obtained anisotropic adjustment through estimated higher-order motion information with extra frames. Based on the motion assumptions, their methods are hard to model the complicated motion in real scenes. In this paper, we propose an end-to-end training method A^2OF for video frame interpolation with event-driven Anisotropic Adjustment of Optical Flows. Specifically, we use events to generate optical flow distribution masks for the intermediate optical flow, which can model the complicated motion between two frames. Our proposed method outperforms the previous methods in video frame interpolation, taking supervised event-based video interpolation to a higher stage.
LGJan 3, 2023
KoopmanLab: machine learning for solving complex physics equationsWei Xiong, Muyuan Ma, Xiaomeng Huang et al.
Numerous physics theories are rooted in partial differential equations (PDEs). However, the increasingly intricate physics equations, especially those that lack analytic solutions or closed forms, have impeded the further development of physics. Computationally solving PDEs by classic numerical approaches suffers from the trade-off between accuracy and efficiency and is not applicable to the empirical data generated by unknown latent PDEs. To overcome this challenge, we present KoopmanLab, an efficient module of the Koopman neural operator family, for learning PDEs without analytic solutions or closed forms. Our module consists of multiple variants of the Koopman neural operator (KNO), a kind of mesh-independent neural-network-based PDE solvers developed following dynamic system theory. The compact variants of KNO can accurately solve PDEs with small model sizes while the large variants of KNO are more competitive in predicting highly complicated dynamic systems govern by unknown, high-dimensional, and non-linear PDEs. All variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier-Stokes equation and the Bateman-Burgers equation in fluid mechanics) and ERA5 (i.e., one of the largest high-resolution global-scale climate data sets in earth physics). These demonstrations suggest the potential of KoopmanLab to be a fundamental tool in diverse physics studies related to equations or dynamic systems.
72.9DCMay 6
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model TrainingYida Gu, Fakang Wang, Jianhao Fu et al.
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.
ROSep 14, 2024Code
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLMKelin Fu, Yang Tian, Kaigui Bian
Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the elderly and individuals with disabilities, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent. With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop PeriGuru in this work, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM). PeriGuru leverages a suite of computer vision techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms. PeriGuru achieves a success rate of 81.94% on the test task set, which surpasses by more than double the method without PeriGuru's GUI image interpreting and prompting design. Our code is available on https://github.com/Z2sJ4t/PeriGuru.
LGNov 27, 2023
GLIME: General, Stable and Local LIME ExplanationZeren Tan, Yang Tian, Jian Li
As black-box machine learning models grow in complexity and find applications in high-stakes scenarios, it is imperative to provide explanations for their predictions. Although Local Interpretable Model-agnostic Explanations (LIME) [22] is a widely adpoted method for understanding model behaviors, it is unstable with respect to random seeds [35,24,3] and exhibits low local fidelity (i.e., how well the explanation approximates the model's local behaviors) [21,16]. Our study shows that this instability problem stems from small sample weights, leading to the dominance of regularization and slow convergence. Additionally, LIME's sampling neighborhood is non-local and biased towards the reference, resulting in poor local fidelity and sensitivity to reference choice. To tackle these challenges, we introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. Within the GLIME framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME. GLIME explanations are independent of reference choice. Moreover, GLIME offers users the flexibility to choose a sampling distribution based on their specific scenarios.
LGDec 4, 2022
Statistical Physics of Deep Neural Networks: Initialization toward Optimal ChannelsKangyu Weng, Aohua Cheng, Ziyang Zhang et al.
In deep learning, neural networks serve as noisy channels between input data and its representation. This perspective naturally relates deep learning with the pursuit of constructing channels with optimal performance in information transmission and representation. While considerable efforts are concentrated on realizing optimal channel properties during network optimization, we study a frequently overlooked possibility that neural networks can be initialized toward optimal channels. Our theory, consistent with experimental validation, identifies primary mechanics underlying this unknown possibility and suggests intrinsic connections between statistical physics and deep learning. Unlike the conventional theories that characterize neural networks applying the classic mean-filed approximation, we offer analytic proof that this extensively applied simplification scheme is not valid in studying neural networks as information channels. To fill this gap, we develop a corrected mean-field framework applicable for characterizing the limiting behaviors of information propagation in neural networks without strong assumptions on inputs. Based on it, we propose an analytic theory to prove that mutual information maximization is realized between inputs and propagated signals when neural networks are initialized at dynamic isometry, a case where information transmits via norm-preserving mappings. These theoretical predictions are validated by experiments on real neural networks, suggesting the robustness of our theory against finite-size effects. Finally, we analyze our findings with information bottleneck theory to confirm the precise relations among dynamic isometry, mutual information maximization, and optimal channel properties in deep learning.
STAT-MECHApr 21, 2022
A unified theory of information transfer and causal relationYang Tian, Hedong Hou, Yaoyuan Wang et al.
Information transfer between coupled stochastic dynamics, measured by transfer entropy and information flow, is suggested as a physical process underlying the causal relation of systems. While information transfer analysis has booming applications in both science and engineering fields, critical mysteries about its foundations remain unsolved. Fundamental yet difficult questions concern how information transfer and causal relation originate, what they depend on, how they differ from each other, and if they are created by a unified and general quantity. These questions essentially determine the validity of causal relation measurement via information transfer. Here we pursue to lay a complete theoretical basis of information transfer and causal relation. Beyond the well-known relations between these concepts that conditionally hold, we demonstrate that information transfer and causal relation universally originate from specific information synergy and redundancy phenomena characterized by high-order mutual information. More importantly, our theory analytically explains the mechanisms for information transfer and causal relation to originate, vanish, and differ from each other. Moreover, our theory naturally defines the effect sizes of information transfer and causal relation based on high-dimensional coupling events. These results may provide a unified view of information, synergy, and causal relation to bridge Pearl's causal inference theory in computer science and information transfer analysis in physics.
CLMay 19, 2025Code
LEXam: Benchmarking Legal Reasoning on 340 Law ExamsYu Fan, Jingwei Ni, Jakob Merane et al. · eth-zurich
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce \textsc{LEXam}, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. We have open-sourced our code on https://github.com/LEXam-Benchmark/LEXam and released our data on https://huggingface.co/datasets/LEXam-Benchmark/LEXam. Project page: https://lexam-benchmark.github.io.
ROOct 15, 2025Code
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot PolicyXinyi Chen, Yilun Chen, Yanwei Fu et al.
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
70.4ROMar 28
Design of an In-Pipe Robot with Contact-Angle-Guided Kinematic Decoupling for Crosstalk-Suppressed LocomotionMin Yang, Yang Tian, Longchuang Li et al.
In-pipe inspection robots must traverse confined pipeline networks with elbows and three-dimensional fittings, requiring both reliable axial traction and rapid rolling reorientation for posture correction. In compact V-shaped platforms, these functions often rely on shared contacts or indirect actuation, which introduces strong kinematic coupling and makes performance sensitive to geometry and friction variations. This paper presents a V-shaped in-pipe robot with a joint-axis-and-wheel-separation layout that provides two physically independent actuation channels, with all-wheel-drive propulsion and motorized rolling reorientation while using only two motors. To make the decoupling mechanism explicit and designable, we formulate an actuation transmission matrix and identify the spherical-wheel contact angle as the key geometric variable governing the dominant roll-to-propulsion leakage and roll-channel efficiency. A geometric transmission analysis maps mounting parameters to the contact angle, leakage, and efficiency, yielding a structural guideline for suppressing crosstalk by driving the contact angle toward zero. A static stability model further provides a stability-domain map for selecting torsion-spring stiffness under friction uncertainty to ensure vertical-pipe stability with a margin. Experiments validate the decoupling effect, where during high-dynamic rolling in a vertical pipe, the propulsion torque remains nearly invariant. On a multi-material testbed including out-of-plane double elbows, the robot achieved a 100% success rate in more than 10 independent round-trip trials.
AINov 1, 2025
GraphChain: Large Language Models for Large-scale Graph Analysis via Tool ChainingChunyu Wei, Wenji Hu, Xingjia Hao et al.
Large Language Models (LLMs) face significant limitations when applied to large-scale graphs, struggling with context constraints and inflexible reasoning. We present GraphChain, a framework that enables LLMs to analyze complex graphs through dynamic sequences of specialized tools, mimicking human exploratory intelligence. Our approach introduces two key innovations: (1) Progressive Graph Distillation, a reinforcement learning mechanism that generates optimized tool sequences balancing task relevance with information compression, and (2) Structure-aware Test-Time Adaptation, which efficiently tailors tool selection strategies to diverse graph topologies using spectral properties and lightweight adapters without costly retraining. Experiments show GraphChain significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis.
93.5LGMay 9
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement LearningYuhang Xu, Kaibin Tian, Yang Tian et al.
Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the long-tail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm's strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by 50% and increases rollout throughput by up to 1.8x. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.
CVMar 24, 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video UnderstandingXiangrui Liu, Yan Shu, Zheng Liu et al.
Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.
LGNov 11, 2025
Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed frameworkXiaoyu Fan, Lin Guo, Ruizhen Jia et al.
Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM's loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for "activity cliff" molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM's error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.
ROJun 24, 2025
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action ModelingHao Li, Shuai Yang, Yilun Chen et al.
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
CLJun 3, 2025
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAGYang Tian, Fan Liu, Jingyuan Zhang et al.
Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose Cross-source knowledge \textbf{Re}conciliation for Multimodal RAG (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6% and 9.3% performance gains on InfoSeek and Encyclopedic-VQA, respectively.
CVJan 21, 2025
Video Deblurring by Sharpness Prior Detection and Edge InformationYang Tian, Fabio Brau, Giulio Rossolini et al.
Video deblurring is essential task for autonomous driving, facial recognition, and security surveillance. Traditional methods directly estimate motion blur kernels, often introducing artifacts and leading to poor results. Recent approaches utilize the detection of sharp frames within video sequences to enhance deblurring. However, existing datasets rely on fixed number of sharp frames, which may be too restrictive for some applications and may introduce a bias during model training. To address these limitations and enhance domain adaptability, this work first introduces GoPro Random Sharp (GoProRS), a new dataset where the the frequency of sharp frames within the sequence is customizable, allowing more diverse training and testing scenarios. Furthermore, it presents a novel video deblurring model, called SPEINet, that integrates sharp frame features into blurry frame reconstruction through an attention-based encoder-decoder architecture, a lightweight yet robust sharp frame detection and an edge extraction phase. Extensive experimental results demonstrate that SPEINet outperforms state-of-the-art methods across multiple datasets, achieving an average of +3.2% PSNR improvement over recent techniques. Given such promising results, we believe that both the proposed model and dataset pave the way for future advancements in video deblurring based on the detection of sharp frames.
96.0ROApr 9
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable WorldsYunsong Zhou, Hangxu Liu, Xuekun Jiang et al.
Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.
CLMar 21, 2025
MMCR: Benchmarking Cross-Source Reasoning in Scientific PapersYang Tian, Zheng Lu, Mingqi Gao et al.
Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
CVOct 26, 2025
Open Multimodal Retrieval-Augmented Factual Image GenerationYang Tian, Fan Liu, Jingyuan Zhang et al.
Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
CVSep 30, 2025
TimeScope: Towards Task-Oriented Temporal Grounding In Long VideosXiangrui Liu, Minghao Qin, Yan Shu et al.
Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task's natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope's ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.
ROJul 23, 2025
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to ManipulationShuai Yang, Hao Li, Yilun Chen et al.
To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
CLJul 1, 2025
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept ErasureYu Fan, Yang Tian, Shauli Ravfogel et al.
Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text's source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate -- often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
LGApr 7, 2025
Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse FingerprintsWenzhong Yan, Feng Yin, Jun Gao et al.
Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this paper, we propose a systematic integration of an Attentional Graph Neural Network (AGNN) model, capable of learning spatial adjacency relationships and aggregating information from neighboring fingerprints, and a meta-learning framework that utilizes datasets with similar environmental characteristics to enhance model training. To minimize the labor required for fingerprint collection, we introduce two novel data augmentation strategies: 1) unlabeled fingerprint augmentation using moving platforms, which enables the semi-supervised AGNN model to incorporate information from unlabeled fingerprints, and 2) synthetic labeled fingerprint augmentation through environmental digital twins, which enhances the meta-learning framework through a practical distribution alignment, which can minimize the feature discrepancy between synthetic and real-world fingerprints effectively. By integrating these novel modules, we propose the Attentional Graph Meta-Learning (AGML) model. This novel model combines the strengths of the AGNN model and the meta-learning framework to address the challenges posed by extremely sparse fingerprints. To validate our approach, we collected multiple datasets from both consumer-grade WiFi devices and professional equipment across diverse environments. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that the AGML model-based localization method consistently outperforms all baseline methods using sparse fingerprints across all evaluated metrics.
CVJan 18, 2025
Efficient auto-labeling of large-scale poultry datasets (ALPD) using an ensemble model with self- and active-learning approachesRamesh Bahadur Bist, Lilong Chai, Shawna Weimer et al.
The rapid growth of artificial intelligence in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time-consuming and costly, making it impractical for modern systems that continuously generate data. This study addresses this challenge by exploring semi-supervised auto-labeling methods, integrating self and active learning approaches to develop an efficient, label-scarce framework for auto-labeling large poultry datasets (ALPD). For this study, video data were collected from broilers and laying hens housed. Various machine learning models, including zero-shot models and supervised models, were utilized for broilers and hens detection. The results showed that YOLOv8s-World and YOLOv9s performed better when compared performance metrics for broiler and hen detection under supervised learning, while among the semi-supervised model, YOLOv8s-ALPD achieved the highest precision (96.1%) and recall (99%) with an RMSE of 1.87. The hybrid YOLO-World model, incorporating the optimal YOLOv8s backbone with zero-shot models, demonstrated the highest overall performance. It achieved a precision of 99.2%, recall of 99.4%, and an F1 score of 98.7% for detection. In addition, the semi-supervised models with minimal human intervention (active learning) reduced annotation time by over 80% compared to full manual labeling. Moreover, integrating zero-shot models with the best models enhanced broiler and hen detection, achieving comparable results to supervised models while significantly increasing speed. In conclusion, integrating semi-supervised auto-labeling and zero-shot models significantly improves detection accuracy. It reduces manual annotation efforts, offering a promising solution to optimize AI-driven systems in poultry farming, advancing precision livestock management, and promoting more sustainable practices.
CVMay 2, 2020
Comparing SNNs and RNNs on Neuromorphic Vision Datasets: Similarities and DifferencesWeihua He, YuJie Wu, Lei Deng et al.
Neuromorphic data, recording frameless spike events, have attracted considerable attention for the spatiotemporal information components and the event-driven processing fashion. Spiking neural networks (SNNs) represent a family of event-driven models with spatiotemporal dynamics for neuromorphic computing, which are widely benchmarked on neuromorphic data. Interestingly, researchers in the machine learning community can argue that recurrent (artificial) neural networks (RNNs) also have the capability to extract spatiotemporal features although they are not event-driven. Thus, the question of "what will happen if we benchmark these two kinds of models together on neuromorphic data" comes out but remains unclear. In this work, we make a systematic study to compare SNNs and RNNs on neuromorphic data, taking the vision datasets as a case study. First, we identify the similarities and differences between SNNs and RNNs (including the vanilla RNNs and LSTM) from the modeling and learning perspectives. To improve comparability and fairness, we unify the supervised learning algorithm based on backpropagation through time (BPTT), the loss function exploiting the outputs at all timesteps, the network structure with stacked fully-connected or convolutional layers, and the hyper-parameters during training. Especially, given the mainstream loss function used in RNNs, we modify it inspired by the rate coding scheme to approach that of SNNs. Furthermore, we tune the temporal resolution of datasets to test model robustness and generalization. At last, a series of contrast experiments are conducted on two types of neuromorphic datasets: DVS-converted (N-MNIST) and DVS-captured (DVS Gesture).