ROMar 31
Efficient Camera Pose Augmentation for View Generalization in Robotic Policy LearningSen Wang, Huaiyi Dong, Jingyi Tian et al.
Prevailing 2D-centric visuomotor policies exhibit a pronounced deficiency in novel view generalization, as their reliance on static observations hinders consistent action mapping across unseen views. In response, we introduce GenSplat, a feed-forward 3D Gaussian Splatting framework that facilitates view-generalized policy learning through novel view rendering. GenSplat employs a permutation-equivariant architecture to reconstruct high-fidelity 3D scenes from sparse, uncalibrated inputs in a single forward pass. To ensure structural integrity, we design a 3D-prior distillation strategy that regularizes the 3DGS optimization, preventing the geometric collapse typical of purely photometric supervision. By rendering diverse synthetic views from these stable 3D representations, we systematically augment the observational manifold during training. This augmentation forces the policy to ground its decisions in underlying 3D structures, thereby ensuring robust execution under severe spatial perturbations where baselines severely degrade.
NIMar 24
Scalable Air-to-Ground Wireless Channel Modeling Using Environmental Context and Generative DiffusionJingyi Tian, Lin Cai
The fast motion of Low Earth Orbit (LEO) satellites causes the propagation channel to vary rapidly, and its behavior is strongly shaped by the surrounding environment, especially at low elevation angles where signals are highly susceptible to terrain blockage and other environmental effects. Existing studies mostly rely on assumed statistical channel distributions and therefore ignore the influence of the actual geographic environment. In this paper, we propose an environment-aware channel modeling method for air-to-ground wireless links. We leverage real environmental data, including digital elevation models (DEMs) and land cover information, together with ray tracing (RT) to determine whether a link is line-of-sight (LOS) or non-line-of-sight (NLOS) and to identify possible reflection paths of the signal. The resulting obstruction and reflection profiles are then combined with models of diffraction loss, vegetation absorption, and atmospheric attenuation to quantitatively characterize channel behavior in realistic geographic environments. Since RT is computationally intensive, we use RT-generated samples and environmental features to train a scalable diffusion model that can efficiently predict channel performance for arbitrary satellite and ground terminal positions, thereby supporting real-time decision-making. In the experiments, we validate the proposed model with measurement data from both cellular and LEO satellite links, demonstrating its effectiveness in realistic environments.
ROJun 19, 2025
FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic ManipulationSen Wang, Le Wang, Sanping Zhou et al.
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.
ROOct 28, 2025
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic ManipulationJingyi Tian, Le Wang, Sanping Zhou et al.
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
CVSep 19, 2025
SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world modelsSen Wang, Jingyi Tian, Le Wang et al.
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.