ROMay 30
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World ModelsZiheng He, Yixiang Chen, Ning Yang et al.
Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.
LGJul 14, 2023Code
Omnipotent Adversarial Training in the WildGuanlin Li, Kangjie Chen, Yuan Xu et al.
Adversarial training is an important topic in robust deep learning, but the community lacks attention to its practical usage. In this paper, we aim to resolve a real-world challenge, i.e., training a model on an imbalanced and noisy dataset to achieve high clean accuracy and adversarial robustness, with our proposed Omnipotent Adversarial Training (OAT) strategy. OAT consists of two innovative methodologies to address the imperfection in the training set. We first introduce an oracle into the adversarial training process to help the model learn a correct data-label conditional distribution. This carefully-designed oracle can provide correct label annotations for adversarial training. We further propose logits adjustment adversarial training to overcome the data imbalance issue, which can help the model learn a Bayes-optimal distribution. Our comprehensive evaluation results show that OAT outperforms other baselines by more than 20% clean accuracy improvement and 10% robust accuracy improvement under complex combinations of data imbalance and label noise scenarios. The code can be found in https://github.com/GuanlinLee/OAT.
NADec 19, 2007
Discrete Fourier analysis, Cubature and Interpolation on a Hexagon and a TriangleHuiyuan Li, Jiachang Sun, Yuan Xu
Several problems of trigonometric approximation on a hexagon and a triangle are studied using the discrete Fourier transform and orthogonal polynomials of two variables. A discrete Fourier analysis on the regular hexagon is developed in detail, from which the analysis on the triangle is deduced. The results include cubature formulas and interpolation on these domains. In particular, a trigonometric Lagrange interpolation on a triangle is shown to satisfy an explicit compact formula, which is equivalent to the polynomial interpolation on a planer region bounded by Steiner's hypocycloid. The Lebesgue constant of the interpolation is shown to be in the order of $(\log n)^2$. Furthermore, a Gauss cubature is established on the hypocycloid.
NAMay 22, 2008
New cubature formulae and hyperinterpolation in three variablesStefano De Marchi, Marco Vianello, Yuan Xu
A new algebraic cubature formula of degree $2n+1$ for the product Chebyshev measure in the $d$-cube with $\approx n^d/2^{d-1}$ nodes is established. The new formula is then applied to polynomial hyperinterpolation of degree $n$ in three variables, in which coefficients of the product Chebyshev orthonormal basis are computed by a fast algorithm based on the 3-dimensional FFT. Moreover, integration of the hyperinterpolant provides a new Clenshaw-Curtis type cubature formula in the 3-cube.
NAJan 5, 2011
On Gauss-Lobatto integration on the triangleYuan Xu
A recent result in [2] on the non-existence of Gauss-Lobatto cubature rules on the triangle is strengthened by establishing a lower bound for the number of nodes of such rules. A method of constructing Lobatto type cubature rules on the triangle is given and used to construct several examples.
NAMar 4, 2008
Discrete Fourier analysis on a dodecahedron and a tetrahedronHuiyuan Li, Yuan Xu
A discrete Fourier analysis on the dodecahedron is studied, from which results on a tetrahedron is deduced by invariance. The results include Fourier analysis in trigonometric functions, interpolation and cubature formulas on these domains. In particular, a trigonometric Lagrange interpolation on the tetrahedron is shown to satisfy an explicit compact formula and the Lebesgue constant of the interpolation is shown to be in the order of $(\log n)^3$.
NAOct 3, 2012
Discrete Fourier Analysis and Chebyshev Polynomials with $G_2$ GroupHuiyuan Li, Jiachang Sun, Yuan Xu
The discrete Fourier analysis on the $30^{\degree}$-$60^{\degree}$-$90^{\degree}$ triangle is deduced from the corresponding results on the regular hexagon by considering functions invariant under the group $G_2$, which leads to the definition of four families generalized Chebyshev polynomials. The study of these polynomials leads to a Sturm-Liouville eigenvalue problem that contains two parameters, whose solutions are analogues of the Jacobi polynomials. Under a concept of $m$-degree and by introducing a new ordering among monomials, these polynomials are shown to share properties of the ordinary orthogonal polynomials. In particular, their common zeros generate cubature rules of Gauss type.
CASep 5, 2008
Discrete Fourier analysis on fundamental domain of $A_d$ lattice and on simplex in $d$-variablesHuiyuan Li, Yuan Xu
A discrete Fourier analysis on the fundamental domain $Ω_d$ of the $d$-dimensional lattice of type $A_d$ is studied, where $Ω_2$ is the regular hexagon and $Ω_3$ is the rhombic dodecahedron, and analogous results on $d$-dimensional simplex are derived by considering invariant and anti-invariant elements. Our main results include Fourier analysis in trigonometric functions, interpolation and cubature formulas on these domains. In particular, a trigonometric Lagrange interpolation on the simplex is shown to satisfy an explicit compact formula and the Lebesgue constant of the interpolation is shown to be in the order of $(\log n)^d$. The basic trigonometric functions on the simplex can be identified with Chebyshev polynomials in several variables already appeared in literature. We study common zeros of these polynomials and show that they are nodes for a family of Gaussian cubature formulas, which provides only the second known example of such formulas.
NAAug 15, 2008
Cubature formula and interpolation on the cubic domainHuiyuan Li, Jiachang Sun, Yuan Xu
Several cubature formulas on the cubic domains are derived using the discrete Fourier analysis associated with lattice tiling, as developed in \cite{LSX}. The main results consist of a new derivation of the Gaussian type cubature for the product Chebyshev weight functions and associated interpolation polynomials on $[-1,1]^2$, as well as new results on $[-1,1]^3$. In particular, compact formulas for the fundamental interpolation polynomials are derived, based on $n^3/4 +\CO(n^2)$ nodes of a cubature formula on $[-1,1]^3$.
NAAug 17, 2011
On positive cubature rules on the simplex and isometric embeddingsMasanori Sawa, Yuan Xu
Positive cubature rules of degree 4 and 5 on the $d$-dimensional simplex are constructed and used to construct cubature rules of index 8 or degree 9 on the unit sphere. The latter ones lead to explicit isometric embedding among the classical Banach spaces. Among other things, our results include several explicit representations of $(x_1^2+...+ x_d^2)^t$ in terms of linear forms of degree $2t$ with rational coefficients for t=4 and 5.
CVAug 22, 2023Code
Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog NavigationYifei Su, Dong An, Yuan Xu et al.
This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/TG-GAT.
LGDec 17, 2025
FrontierCS: Evolving Challenges for Evolving IntelligenceQiuyang Mang, Wenhao Chai, Zhifei Li et al.
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
CVFeb 2Code
One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image GenerationShuo Lu, Haohan Wang, Wei Feng et al.
Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.
PFMar 11Code
RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation SystemsShaobo Li, Yirui Zhou, Yuan Xu et al.
We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.
NAJan 14, 2013
Minimal cubature rules on an unbounded domainYuan Xu
A family of minimal cubature rules is established on an unbounded domain, which is the first such family known on unbounded domains. The nodes of such cubature rules are common zeros of certain orthogonal polynomials on the unbounded domain, which are also constructed.
ROMar 30
EgoDemoGen: Egocentric Demonstration Generation for Viewpoint Generalization in Robotic ManipulationYuan Xu, Jiabing Yang, Xiaofeng Wang et al.
Imitation learning based visuomotor policies have achieved strong performance in robotic manipulation, yet they often remain sensitive to egocentric viewpoint shifts. Unlike third-person viewpoint changes that only move the camera, egocentric shifts simultaneously alter both the camera pose and the robot action coordinate frame, making it necessary to jointly transfer action trajectories and synthesize corresponding observations under novel egocentric viewpoints. To address this challenge, we present EgoDemoGen, a framework that generates paired observation--action demonstrations under novel egocentric viewpoints through two key components: 1{)} EgoTrajTransfer, which transfers robot trajectories to the novel egocentric coordinate frame through motion-skill segmentation, geometry-aware transformation, and inverse kinematics filtering; and 2{)} EgoViewTransfer, a conditional video generation model that fuses a novel-viewpoint reprojected scene video and a robot motion video rendered from the transferred trajectory to synthesize photorealistic observations, trained with a self-supervised double reprojection strategy without requiring multi-viewpoint data. Experiments in simulation and real-world settings show that EgoDemoGen consistently improves policy success rates under both standard and novel egocentric viewpoints, with absolute gains of +24.6\% and +16.9\% in simulation and +16.0\% and +23.0\% on the real robot. Moreover, EgoViewTransfer achieves superior video generation quality for novel egocentric observations.
NANov 4, 2008
OPED reconstruction algorithm for limited angle problemYuan Xu, Oleg Tischenko
The structure of the reconstruction algorithm OPED permits a natural way to generate additional data, while still preserving the essential feature of the algorithm. This provides a method for image reconstruction for limited angel problems. In stead of completing the set of data, the set of discrete sine transforms of the data is completed. This is achieved by solving systems of linear equations that have, upon choosing appropriate parameters, positive definite coefficient matrices. Numerical examples are presented.
CVApr 21Code
EgoSelf: From Memory to Personalized Egocentric AssistantYanshuo Wang, Yuan Xu, Xuesong Li et al.
Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself\_project/}.
CVOct 22, 2024Code
AlphaChimp: Tracking and Behavior Recognition of ChimpanzeesXiaoxuan Ma, Yutang Lin, Yuan Xu et al. · pku
Understanding non-human primate behavior is crucial for improving animal welfare, modeling social behavior, and gaining insights into both distinctly human and shared behaviors. Despite recent advances in computer vision, automated analysis of primate behavior remains challenging due to the complexity of their social interactions and the lack of specialized algorithms. Existing methods often struggle with the nuanced behaviors and frequent occlusions characteristic of primate social dynamics. This study aims to develop an effective method for automated detection, tracking, and recognition of chimpanzee behaviors in video footage. Here we show that our proposed method, AlphaChimp, an end-to-end approach that simultaneously detects chimpanzee positions and estimates behavior categories from videos, significantly outperforms existing methods in behavior recognition. AlphaChimp achieves approximately 10% higher tracking accuracy and a 20% improvement in behavior recognition compared to state-of-the-art methods, particularly excelling in the recognition of social behaviors. This superior performance stems from AlphaChimp's innovative architecture, which integrates temporal feature fusion with a Transformer-based self-attention mechanism, enabling more effective capture and interpretation of complex social interactions among chimpanzees. Our approach bridges the gap between computer vision and primatology, enhancing technical capabilities and deepening our understanding of primate communication and sociality. We release our code and models and hope this will facilitate future research in animal social dynamics. This work contributes to ethology, cognitive science, and artificial intelligence, offering new perspectives on social intelligence.
OSMar 18
AppFlow: Memory Scheduling for Cold Launch of Large Apps on Mobile and Vehicle SystemsXiaochen Li, Sicong Liu, Bin Guo et al.
GB-scale large apps like on-device LLMs and rich media editors are becoming the next-generation trend, but their heavy memory and I/O demands, especially during multitasking, cause devices to reclaim or kill processes, turning warm apps into cold launches. The challenge lies not in storing them, but in fast, accurate launching. For users, 1s is the usability cliff, yet our measurements show 86.6\% of GB-scale cold launches exceed it. Also, Android Vitals flags only $\geq$ 5s as slow, exposing a large satisfaction gap. Existing optimizations are designed in isolation and conflict. For example, preloading reduces I/O stalls but consumes scarce memory and is undone by reclamation, while reclamation and killing free memory but sacrifice background survivability, leading to repeated cold relaunches. Our key insight is that, although multitasking makes runtime behavior complex, each app's file access pattern remains predictable. The challenge lies in exploiting this predictability, i.e., preloading without exhausting memory, reclaiming without undoing gains, and killing selectively to preserve background survivability. We introduce AppFlow, a prediction-based system-wide scheduler that integrates a Selective File Preloader, an Adaptive Memory Reclaimer, and a Context-Aware Process Killer. Implemented across the Android framework and Linux kernel without app changes, AppFlow cuts GB-scale cold-launch latency by 66.5\% (e.g., 2s$\rightarrow$690ms) and sustains 95\% of launches within 1s over a 100-day test, significantly improving responsiveness and multitasking experience.
ROFeb 3
BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment MasksYixiang Chen, Peiyan Li, Jiabing Yang et al.
Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .
CVJun 30, 2024Code
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language ModelsWeihong Zhong, Xiaocheng Feng, Liang Zhao et al.
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs' behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least $31\%$, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than $24\%$ of the snowballed multimodal hallucination while maintaining capabilities.
ROMay 14, 2025Code
APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose RegressionSrinivas Ravuri, Yuan Xu, Martin Ludwig Zehetner et al.
Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:https://github.com/GT-ARC/APR-Transformer.
ROMay 5
Height Control and Optimal Torque Planning for Jumping With Wheeled-Bipedal RobotsYulun Zhuang, Yuan Xu, Binxin Huang et al.
This paper mainly studies the accurate height jumping control of wheeled-bipedal robots based on torque planning and energy consumption optimization. Due to the characteristics of underactuated, nonlinear estimation, and instantaneous impact in the jumping process, accurate control of the wheeled-bipedal robot's jumping height is complicated. In reality, robots often jump at excessive height to ensure safety, causing additional motor loss, greater ground reaction force and more energy consumption. To solve this problem, a novel wheeled-bipedal jumping dynamical model(W-JBD) is proposed to achieve accurate height control. It performs well but not suitable for the real robot because the torque has a striking step. Therefore, the Bayesian optimization for torque planning method(BOTP) is proposed, which can obtain the optimal torque planning without accurate dynamic model and within few iterations. BOTP method can reduce 82.3% height error, 26.9% energy cost with continuous torque curve. This result is validated in the Webots simulation platform. Based on the torque curve obtained in the W-JBD model to narrow the searching space, BOTP can quickly converge (40 times on average). Cooperating W-JBD model and BOTP method, it is possible to achieve the height control of real robots with reasonable times of experiments.
ROApr 24
GazeVLA: Learning Human Intention for Robotic ManipulationChengyang Li, Kaiyi Xiong, Yuan Xu et al.
Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.
ROApr 3
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action ModelPeiyan Li, Yixiang Chen, Yuan Xu et al.
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.
CVFeb 20
UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action ModelsJiabing Yang, Yixiang Chen, Yuan Xu et al.
Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.
ROApr 10
Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator GeneralizationYu Liu, Yihang Yin, Tianlv Huang et al.
Assistive teleoperation enhances efficiency via shared control, yet inter-operator variability, stemming from diverse habits and expertise, induces highly heterogeneous trajectory distributions that undermine intent recognition stability. We present Adaptor, a few-shot framework for robust cross-operator intent recognition. The Adaptor bridges the domain gap through two stages: (i) preprocessing, which models intent uncertainty by synthesizing trajectory perturbations via noise injection and performs geometry-aware keyframe extraction; and (ii) policy learning, which encodes the processed trajectories with an Intention Expert and fuses them with the pre-trained vision-language model context to condition an Action Expert for action generation. Experiments on real-world and simulated benchmarks demonstrate that Adaptor achieves state-of-the-art performance, improving success rates and efficiency over baselines. Moreover, the method exhibits low variance across operators with varying expertise, demonstrating robust cross-operator generalization.
ROOct 22, 2025
GigaBrain-0: A World Model-Powered Vision-Language-Action ModelGigaBrain Team, Angen Ye, Boyuan Wang et al.
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
LGMar 6, 2025
CrowdHMTware: A Cross-level Co-adaptation Middleware for Context-aware Mobile DL DeploymentSicong Liu, Bin Guo, Shiyan Luo et al.
There are many deep learning (DL) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives.To enable robust and private mobile sensing, DL models are often deployed locally on resource-constrained mobile devices using techniques such as model compression or offloading.However, existing methods, either front-end algorithm level (i.e. DL model compression/partitioning) or back-end scheduling level (i.e. operator/resource scheduling), cannot be locally online because they require offline retraining to ensure accuracy or rely on manually pre-defined strategies, struggle with dynamic adaptability.The primary challenge lies in feeding back runtime performance from the back-end level to the front-end level optimization decision. Moreover, the adaptive mobile DL model porting middleware with cross-level co-adaptation is less explored, particularly in mobile environments with diversity and dynamics. In response, we introduce CrowdHMTware, a dynamic context-adaptive DL model deployment middleware for heterogeneous mobile devices. It establishes an automated adaptation loop between cross-level functional components, i.e. elastic inference, scalable offloading, and model-adaptive engine, enhancing scalability and adaptability. Experiments with four typical tasks across 15 platforms and a real-world case study demonstrate that CrowdHMTware can effectively scale DL model, offloading, and engine actions across diverse platforms and tasks. It hides run-time system issues from developers, reducing the required developer expertise.
CVNov 25, 2025
GigaWorld-0: World Models as Data Engine to Empower Embodied AIGigaWorld Team, Angen Ye, Boyuan Wang et al.
World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
CVOct 12, 2025
Seeing My Future: Predicting Situated Interaction Behavior in Virtual RealityYuan Xu, Zimu Zhang, Xiaoxuan Ma et al.
Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.
DCSep 25, 2025
Kant: An Efficient Unified Scheduling System for Large-Scale AI ClustersLingling Zeng, Gen Zhang, Jialin Peng et al.
As AI cluster sizes continue to expand and the demand for large-language-model (LLM) training and inference workloads grows rapidly, traditional scheduling systems face significant challenges in balancing resource utilization, scheduling efficiency, and service quality. This paper presents and evaluates Kant: an efficient unified scheduling platform designed for large-scale AI container clusters, supporting the co-scheduling of both training and inference jobs. Based on the practical implementation of the Kant system, we systematically define a set of key evaluation metrics for AI clusters, including GPU Allocation Ratio (GAR), Scheduling Occupancy Rate (SOR), GPU Node Fragmentation Ratio (GFR), Job Waiting Time Distribution (JWTD), and Job Training Time Estimation Distribution (JTTED), providing a foundation for quantitative performance analysis. Experimental results demonstrate that Kant achieves exceptional performance in clusters ranging from hundreds to tens of thousands of GPUs. By leveraging scheduling strategies such as Backfill and Enhanced Binpack (E-Binpack), the system significantly improves resource utilization and scheduling efficiency, while effectively reducing resource fragmentation and communication overhead in distributed training. The system has been deployed in multiple AI data center clusters, where it stably supports large-scale intelligent computing workloads. This work provides a practical engineering approach for building high-performance, highly available, AI-native scheduling infrastructure.
CLAug 6, 2025
DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text GenerationJiabing Yang, Yixiang Chen, Zichen Wen et al.
Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA's superior effectiveness in long text generation.
CLJul 19, 2025
X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor DisplayXiaolin Yan, Yangxing Liu, Jiazhang Zheng et al.
Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry's complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.
LGMar 24, 2025
Latent Embedding Adaptation for Human Preference Alignment in Diffusion PlannersWen Zheng Terence Ng, Jianda Chen, Yuan Xu et al.
This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.
OPTICSMar 20, 2025
Nano-3D: Metasurface-Based Neural Depth ImagingBingxuan Li, Jiahao Wu, Yuan Xu et al.
Depth imaging is a foundational building block for broad applications, such as autonomous driving and virtual/augmented reality. Traditionally, depth cameras have relied on time-of-flight sensors or multi-lens systems to achieve physical depth measurements. However, these systems often face a trade-off between a bulky form factor and imprecise approximations, limiting their suitability for spatially constrained scenarios. Inspired by the emerging advancements of nano-optics, we present Nano-3D, a metasurface-based neural depth imaging solution with an ultra-compact footprint. Nano-3D integrates our custom-fabricated 700 nm thick TiO2 metasurface with a multi-module deep neural network to extract precise metric depth information from monocular metasurface-polarized imagery. We demonstrate the effectiveness of Nano-3D with both simulated and physical experiments. We hope the exhibited success paves the way for the community to bridge future graphics systems with emerging nanomaterial technologies through novel computational approaches.
CRMar 23, 2021
Risk Analysis and Policy Enforcement of Function Interactions in Robot AppsYuan Xu, Tianwei Zhang, Yungang Bao
Robot apps are becoming more automated, complex and diverse. An app usually consists of many functions, interacting with each other and the environment. This allows robots to conduct various tasks. However, it also opens a new door for cyber attacks: adversaries can leverage these interactions to threaten the safety of robot operations. Unfortunately, this issue is rarely explored in past works. We present the first systematic investigation about the function interactions in common robot apps. First, we disclose the potential risks and damages caused by malicious interactions. We introduce a comprehensive graph to model the function interactions in robot apps by analyzing 3,100 packages from the Robot Operating System (ROS) platform. From this graph, we identify and categorize three types of interaction risks. Second, we propose RTron, a novel system to detect and mitigate these risks and protect the operations of robot apps. We introduce security policies for each type of risks, and design coordination nodes to enforce the policies and regulate the interactions. We conduct extensive experiments on 110 robot apps from the ROS platform and two complex apps (Baidu Apollo and Autoware) widely adopted in industry. Evaluation results indicated RTron can correctly identify and mitigate all potential risks with negligible performance cost. To validate the practicality of the risks and solutions, we implement and evaluate RTron on a physical UGV (Turtlebot) with real-word apps and environments.
ROMay 2, 2018
Avalon: Building an Operating System for RobotcenterYuan Xu, Zhiyuan Yan, Sa Wang et al.
This paper envisions a scenario that hundreds of heterogeneous robots form a robotcenter which can be shared by multiple users and used like a single powerful robot to perform complex tasks. However, current multi-robot systems are either unable to manage heterogeneous robots or unable to support multiple concurrent users. Inspired by the design of modern datacenter OSes, we propose Avalon, a robot operating system with two-level scheduling scheme which is widely adopted in datacenters for Internet services and cloud computing. Specifically, Avalon integrates three important features together: (1) Instead of allocating a whole robot, Avalon classifies fine-grained robot resources into three categories to distinguish which fine-grained resources can be shared by multi-robot frameworks simultaneously. (2) Avalon adopts a location based resource allocation policy to substantially reduce scheduling overhead. (3) Avalon enables robots to offload computation intensive tasks to the clouds.We have implemented and evaluated Avalon on robots on both simulated environments and real world.
NAFeb 12, 2011
Minimal Cubature rules and polynomial interpolation in two variablesYuan Xu
Minimal cubature rules of degree $4n-1$ for the weight functions $$ W_{\a,\b,\pm \frac12}(x,y) = |x+y|^{2\a+1} |x-y|^{2\b+1} ((1-x^2)(1-y^2))^{\pm \frac12} $$ on $[-1,1]^2$ are constructed explicitly and are shown to be closed related to the Gaussian cubature rules in a domain bounded by two lines and a parabola. Lagrange interpolation polynomials on the nodes of these cubature rules are constructed and their Lebesgue constants are determined.
NAOct 28, 2009
Discrete Fourier analysis with lattices on planar domainsHuiyuan Li, Jiachang Sun, Yuan Xu
A discrete Fourier analysis associated with translation lattices is developed recently by the authors. It permits two lattices, one determining the integral domain and the other determining the family of exponential functions. Possible choices of lattices are discussed in the case of lattices that tile $\RR^2$ and several new results on cubature and interpolation by trigonometric, as well as algebraic, polynomials are obtained.
NAMar 21, 2007
Fast OPED algorithm for reconstruction of images from Radon dataYuan Xu, Oleg Tischenko
A fast implementation of the OPED algorithm, a reconstruction algorithm for Radon data introduced recently, is proposed and tested. The new implementation uses FFT for discrete sine transform and an interpolation step. The convergence of the fast implementation is proved under the condition that the function is mildly smooth. The numerical test shows that the accuracy of the OPED algorithm changes little when the fast implementation is used.
NAApr 27, 2006
Bivariate Lagrange interpolation at the Padua points: the ideal theory approachLen Bos, Stefano De Marchi, Marco Vianello et al.
Padua points is a family of points on the square $[-1,1]^2$ given by explicit formulas that admits unique Lagrange interpolation by bivariate polynomials. The interpolation polynomials and cubature formulas based on the Padua points are studied from an ideal theoretic point of view, which leads to the discovery of a compact formula for the interpolation polynomials. The $L^p$ convergence of the interpolation polynomials is also studied.
NAMar 9, 2006
Approximation and Reconstruction from Attenuated Radon ProjectionsYuan Xu, Oleg Tischenko, Christoph Hoeschen
Attenuated Radon projections with respect to the weight function $W_μ(x,y) = (1-x^2-y^2)^{μ-1/2}$ are shown to be closely related to the orthogonal expansion in two variables with respect to $W_μ$. This leads to an algorithm for reconstructing two dimensional functions (images) from attenuated Radon projections. Similar results are established for reconstructing functions on the sphere from projections described by integrals over circles on the sphere, and for reconstructing functions on the three-dimensional ball and cylinder domains.
CAOct 15, 2005
A new approach to the reconstruction of images from Radon projectionsYuan Xu
A new approach is proposed for reconstruction of images from Radon projections. Based on Fourier expansions in orthogonal polynomials of two and three variables, instead of Fourier transforms, the approach provides a new algorithm for the computed tomography. The convergence of the algorithm is established under mild assumptions.
NAMar 16, 2005
Reconstruction of a polynomial from its Radon projectionsBorislav Bojanov, Yuan Xu
A polynomial of degree $n$ in two variables is shown to be uniquely determined by its Radon projections taken over $[n/2]+1$ parallel lines in each of the $(2[(n+1)/2]+1)$ equidistant directions along the unit circle.
NAJul 27, 2004
Polynomial Interpolation on the Unit Sphere IIWolfgang zu Castell, Noemi Lain Fernandez, Yuan Xu
The problem of interpolation at $(n+1)^2$ points on the unit sphere $\mathbb{S}^2$ by spherical polynomials of degree at most $n$ is proved to have a unique solution for several sets of points. The points are located on a number of circles on the sphere with even number of points on each circle. The proof is based on a method of factorization of polynomials.