CVOct 5, 2022Code
Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object DetectionJinhyung Park, Chenfeng Xu, Shijia Yang et al. · cmu
While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works' fusion of multi-frame images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set. Code will be released $\href{https://github.com/Divadi/SOLOFusion}{here.}$
CVApr 27, 2023Code
SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object DetectionYichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona et al.
By identifying four important components of existing LiDAR-camera 3D object detection methods (LiDAR and camera candidates, transformation, and fusion outputs), we observe that all existing methods either find dense candidates or yield dense representations of scenes. However, given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient. We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones. We perform extensive experiments to demonstrate the effectiveness and efficiency of our modules and overall method pipeline. Our code will be made publicly available at https://github.com/yichen928/SparseFusion.
CVJul 27, 2023Code
NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object DetectionChenfeng Xu, Bichen Wu, Ji Hou et al.
We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.
CVFeb 13, 2023Code
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingHaoyu Lu, Yuqi Huo, Guoxing Yang et al.
Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.
CVJul 19, 2022Code
What Matters for 3D Scene Flow NetworkGuangming Wang, Yunzhe Hu, Zhe Liu et al.
3D scene flow estimation from point clouds is a low-level 3D motion perception task in computer vision. Flow embedding is a commonly used technique in scene flow estimation, and it encodes the point motion between two consecutive frames. Thus, it is critical for the flow embeddings to capture the correct overall direction of the motion. However, previous works only search locally to determine a soft correspondence, ignoring the distant points that turn out to be the actual matching ones. In addition, the estimated correspondence is usually from the forward direction of the adjacent point clouds, and may not be consistent with the estimated correspondence acquired from the backward direction. To tackle these problems, we propose a novel all-to-all flow embedding layer with backward reliability validation during the initial scene flow estimation. Besides, we investigate and compare several design choices in key components of the 3D scene flow network, including the point similarity calculation, input elements of predictor, and predictor & refinement level design. After carefully choosing the most effective designs, we are able to present a model that achieves the state-of-the-art performance on FlyingThings3D and KITTI Scene Flow datasets. Our proposed model surpasses all existing methods by at least 38.2% on FlyingThings3D dataset and 24.7% on KITTI Scene Flow dataset for EPE3D metric. We release our codes at https://github.com/IRMVLab/3DFlow.
CVMar 25, 2023Code
Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning ParadigmYichen Xie, Han Lu, Junchi Yan et al.
Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover's distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation. Our code is released at https://github.com/yichen928/ActiveFT.
CVApr 21, 2022
PreTraM: Self-Supervised Pre-training via Connecting Trajectory and MapChenfeng Xu, Tian Li, Chen Tang et al. · berkeley
Deep learning has recently achieved significant progress in trajectory forecasting. However, the scarcity of trajectory data inhibits the data-hungry deep-learning models from learning good representations. While mature representation learning methods exist in computer vision and natural language processing, these pre-training methods require large-scale data. It is hard to replicate these approaches in trajectory forecasting due to the lack of adequate trajectory data (e.g., 34K samples in the nuScenes dataset). To work around the scarcity of trajectory data, we resort to another data modality closely related to trajectories-HD-maps, which is abundantly provided in existing datasets. In this paper, we propose PreTraM, a self-supervised pre-training scheme via connecting trajectories and maps for trajectory forecasting. Specifically, PreTraM consists of two parts: 1) Trajectory-Map Contrastive Learning, where we project trajectories and maps to a shared embedding space with cross-modal contrastive learning, and 2) Map Contrastive Learning, where we enhance map representation with contrastive learning on large quantities of HD-maps. On top of popular baselines such as AgentFormer and Trajectron++, PreTraM boosts their performance by 5.5% and 6.9% relatively in FDE-10 on the challenging nuScenes dataset. We show that PreTraM improves data efficiency and scales well with model size.
CVMar 17, 2022Code
DetMatch: Two Teachers are Better Than One for Joint 2D and 3D Semi-Supervised Object DetectionJinhyung Park, Chenfeng Xu, Yiyang Zhou et al.
While numerous 3D detection works leverage the complementary relationship between RGB images and point clouds, developments in the broader framework of semi-supervised object recognition remain uninfluenced by multi-modal fusion. Current methods develop independent pipelines for 2D and 3D semi-supervised learning despite the availability of paired image and point cloud frames. Observing that the distinct characteristics of each sensor cause them to be biased towards detecting different objects, we propose DetMatch, a flexible framework for joint semi-supervised learning on 2D and 3D modalities. By identifying objects detected in both sensors, our pipeline generates a cleaner, more robust set of pseudo-labels that both demonstrates stronger performance and stymies single-modality error propagation. Further, we leverage the richer semantics of RGB images to rectify incorrect 3D class predictions and improve localization of 3D boxes. Evaluating on the challenging KITTI and Waymo datasets, we improve upon strong semi-supervised learning methods and observe higher quality pseudo-labels. Code will be released at https://github.com/Divadi/DetMatch
CVSep 18, 2023Code
Pre-training on Synthetic Driving Data for Trajectory PredictionYiheng Li, Seth Z. Zhao, Chenfeng Xu et al.
Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting. The solution is composed of two parts: firstly, we adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them. Specifically, we apply vector transformations to reshape the maps, and then employ a rule-based model to generate trajectories on both original and augmented scenes; thus enlarging the driving data without collecting additional real ones. To foster the learning of general representations within this augmented dataset, we comprehensively explore the different pre-training strategies, including extending the concept of a Masked AutoEncoder (MAE) for trajectory forecasting. Without bells and whistles, our proposed pipeline-level solution is general, simple, yet effective: we conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies, which outperform the baseline prediction model by large margins, e.g. 5.04%, 3.84% and 8.30% in terms of $MR_6$, $minADE_6$ and $minFDE_6$. The pre-training dataset and the codes for pre-training and fine-tuning are released at https://github.com/yhli123/Pretraining_on_Synthetic_Driving_Data_for_Trajectory_Prediction.
CVAug 8, 2023
DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point CloudsChensheng Peng, Guangming Wang, Xian Wan Lo et al. · berkeley
Point clouds are naturally sparse, while image pixels are dense. The inconsistency limits feature fusion from both modalities for point-wise scene flow estimation. Previous methods rarely predict scene flow from the entire point clouds of the scene with one-time inference due to the memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation. To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids. Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion. We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset.
ROMar 28, 2022
Domain Knowledge Driven Pseudo Labels for Interpretable Goal-Conditioned Interactive Trajectory PredictionLingfeng Sun, Chen Tang, Yaru Niu et al. · berkeley
Motion forecasting in highly interactive scenarios is a challenging problem in autonomous driving. In such scenarios, we need to accurately predict the joint behavior of interacting agents to ensure the safe and efficient navigation of autonomous vehicles. Recently, goal-conditioned methods have gained increasing attention due to their advantage in performance and their ability to capture the multimodality in trajectory distribution. In this work, we study the joint trajectory prediction problem with the goal-conditioned framework. In particular, we introduce a conditional-variational-autoencoder-based (CVAE) model to explicitly encode different interaction modes into the latent space. However, we discover that the vanilla model suffers from posterior collapse and cannot induce an informative latent space as desired. To address these issues, we propose a novel approach to avoid KL vanishing and induce an interpretable interactive latent space with pseudo labels. The proposed pseudo labels allow us to incorporate domain knowledge on interaction in a flexible manner. We motivate the proposed method using an illustrative toy example. In addition, we validate our framework on the Waymo Open Motion Dataset with both quantitative and qualitative evaluations.
CVAug 1, 2024
Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable GenerationYixiao Wang, Chen Tang, Lingfeng Sun et al. · berkeley
Diffusion models are promising for joint trajectory prediction and controllable generation in autonomous driving, but they face challenges of inefficient inference steps and high computational demands. To tackle these challenges, we introduce Optimal Gaussian Diffusion (OGD) and Estimated Clean Manifold (ECM) Guidance. OGD optimizes the prior distribution for a small diffusion time $T$ and starts the reverse diffusion process from it. ECM directly injects guidance gradients to the estimated clean manifold, eliminating extensive gradient backpropagation throughout the network. Our methodology streamlines the generative process, enabling practical applications with reduced computational overhead. Experimental validation on the large-scale Argoverse 2 dataset demonstrates our approach's superior performance, offering a viable solution for computationally efficient, high-quality joint trajectory prediction and controllable generation for autonomous driving. Our project webpage is at https://yixiaowang7.github.io/OptTrajDiff_Page/.
ROOct 4, 2023
Human-oriented Representation Learning for Robotic ManipulationMingxiao Huo, Mingyu Ding, Chenfeng Xu et al. · berkeley
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning
CVSep 29, 2023Code
Towards Free Data Selection with General-Purpose ModelsYichen Xie, Mingyu Ding, Masayoshi Tomizuka et al.
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from inter-mediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods. Extensive experiments verify the effectiveness of FreeSel on various computer vision tasks. Our code is available at https://github.com/yichen928/FreeSel.
CVSep 1, 2024Code
DSLO: Deep Sequence LiDAR Odometry Based on Inconsistent Spatio-temporal PropagationHuixin Zhang, Guangming Wang, Xinrui Wu et al.
This paper introduces a 3D point cloud sequence learning model based on inconsistent spatio-temporal propagation for LiDAR odometry, termed DSLO. It consists of a pyramid structure with a spatial information reuse strategy, a sequential pose initialization module, a gated hierarchical pose refinement module, and a temporal feature propagation module. First, spatial features are encoded using a point feature pyramid, with features reused in successive pose estimations to reduce computational overhead. Second, a sequential pose initialization method is introduced, leveraging the high-frequency sampling characteristic of LiDAR to initialize the LiDAR pose. Then, a gated hierarchical pose refinement mechanism refines poses from coarse to fine by selectively retaining or discarding motion information from different layers based on gate estimations. Finally, temporal feature propagation is proposed to incorporate the historical motion information from point cloud sequences, and address the spatial inconsistency issue when transmitting motion information embedded in point clouds between frames. Experimental results on the KITTI odometry dataset and Argoverse dataset demonstrate that DSLO outperforms state-of-the-art methods, achieving at least a 15.67\% improvement on RTE and a 12.64\% improvement on RRE, while also achieving a 34.69\% reduction in runtime compared to baseline methods. Our implementation will be available at https://github.com/IRMVLab/DSLO.
ROOct 4, 2023
LanguageMPC: Large Language Models as Decision Makers for Autonomous DrivingHao Sha, Yao Mu, Yuxuan Jiang et al.
Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability. To address these problems, this work employs Large Language Models (LLMs) as a decision-making component for complex AD scenarios that require human commonsense understanding. We devise cognitive pathways to enable comprehensive reasoning with LLMs, and develop algorithms for translating LLM decisions into actionable driving commands. Through this approach, LLM decisions are seamlessly integrated with low-level controllers by guided parameter matrix adaptation. Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination, thanks to the commonsense reasoning capabilities of LLMs. This paper presents an initial step toward leveraging LLMs as effective decision-makers for intricate AD scenarios in terms of safety, efficiency, generalizability, and interoperability. We aspire for it to serve as inspiration for future research in this field. Project page: https://sites.google.com/view/llm-mpc
88.6CVMay 25Code
Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State EvolutionTianshuo Xu, Yichen Xie, Depu Meng et al.
Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.
ROMar 24, 2023
Editing Driver Character: Socially-Controllable Behavior Generation for Interactive Traffic SimulationWei-Jer Chang, Chen Tang, Chenran Li et al.
Traffic simulation plays a crucial role in evaluating and improving autonomous driving planning systems. After being deployed on public roads, autonomous vehicles need to interact with human road participants with different social preferences (e.g., selfish or courteous human drivers). To ensure that autonomous vehicles take safe and efficient maneuvers in different interactive traffic scenarios, we should be able to evaluate autonomous vehicles against reactive agents with different social characteristics in the simulation environment. We propose a socially-controllable behavior generation (SCBG) model for this purpose, which allows the users to specify the level of courtesy of the generated trajectory while ensuring realistic and human-like trajectory generation through learning from real-world driving data. Specifically, we define a novel and differentiable measure to quantify the level of courtesy of driving behavior, leveraging marginal and conditional behavior prediction models trained from real-world driving data. The proposed courtesy measure allows us to auto-label the courtesy levels of trajectories from real-world driving data and conveniently train an SCBG model generating trajectories based on the input courtesy values. We examined the SCBG model on the Waymo Open Motion Dataset (WOMD) and showed that we were able to control the SCBG model to generate realistic driving behaviors with desired courtesy levels. Interestingly, we found that the SCBG model was able to identify different motion patterns of courteous behaviors according to the scenarios.
CVFeb 24
RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray SpaceYichen Xie, Chensheng Peng, Mazen Abdelfattah et al. · berkeley
World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.
ROApr 19, 2022
Interventional Behavior Prediction: Avoiding Overly Confident Anticipation in Interactive PredictionChen Tang, Wei Zhan, Masayoshi Tomizuka
Conditional behavior prediction (CBP) builds up the foundation for a coherent interactive prediction and planning framework that can enable more efficient and less conservative maneuvers in interactive scenarios. In CBP task, we train a prediction model approximating the posterior distribution of target agents' future trajectories conditioned on the future trajectory of an assigned ego agent. However, we argue that CBP may provide overly confident anticipation on how the autonomous agent may influence the target agents' behavior. Consequently, it is risky for the planner to query a CBP model. Instead, we should treat the planned trajectory as an intervention and let the model learn the trajectory distribution under intervention. We refer to it as the interventional behavior prediction (IBP) task. Moreover, to properly evaluate an IBP model with offline datasets, we propose a Shapley-value-based metric to verify if the prediction model satisfies the inherent temporal independence of an interventional distribution. We show that the proposed metric can effectively identify a CBP model violating the temporal independence, which plays an important role when establishing IBP benchmarks.
ROSep 18, 2023
Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline DemonstrationJinning Li, Xinyi Liu, Banghua Zhu et al.
Safe Reinforcement Learning (RL) aims to find a policy that achieves high rewards while satisfying cost constraints. When learning from scratch, safe RL agents tend to be overly conservative, which impedes exploration and restrains the overall performance. In many realistic tasks, e.g. autonomous driving, large-scale expert demonstration data are available. We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. Large-capacity models, e.g. decision transformers (DT), have been proven to be competent in offline policy learning. However, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. Besides, these bulk policy networks cannot meet the computation speed requirements at inference time on real-world tasks such as autonomous driving. To this end, we propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework. GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms. Experiments in both benchmark safe RL tasks and real-world driving tasks based on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully distill lightweight policies and solve decision-making problems in challenging safety-critical scenarios.
CVApr 27, 2023
Quadric Representations for LiDAR Odometry, Mapping and LocalizationChao Xia, Chenfeng Xu, Patrick Rim et al.
Current LiDAR odometry, mapping and localization methods leverage point-wise representations of 3D scenes and achieve high accuracy in autonomous driving tasks. However, the space-inefficiency of methods that use point-wise representations limits their development and usage in practical applications. In particular, scan-submap matching and global map representation methods are restricted by the inefficiency of nearest neighbor searching (NNS) for large-volume point clouds. To improve space-time efficiency, we propose a novel method of describing scenes using quadric surfaces, which are far more compact representations of 3D objects than conventional point clouds. In contrast to point cloud-based methods, our quadric representation-based method decomposes a 3D scene into a collection of sparse quadric patches, which improves storage efficiency and avoids the slow point-wise NNS process. Our method first segments a given point cloud into patches and fits each of them to a quadric implicit function. Each function is then coupled with other geometric descriptors of the patch, such as its center position and covariance matrix. Collectively, these patch representations fully describe a 3D scene, which can be used in place of the original point cloud and employed in LiDAR odometry, mapping and localization algorithms. We further design a novel incremental growing method for quadric representations, which eliminates the need to repeatedly re-fit quadric surfaces from the original point cloud. Extensive odometry, mapping and localization experiments on large-volume point clouds in the KITTI and UrbanLoco datasets demonstrate that our method maintains low latency and memory utility while achieving competitive, and even superior, accuracy.
CVSep 26, 2022
Center Feature Fusion: Selective Multi-Sensor Fusion of Center-based ObjectsPhilip Jacobson, Yiyang Zhou, Wei Zhan et al.
Leveraging multi-modal fusion, especially between camera and LiDAR, has become essential for building accurate and robust 3D object detection systems for autonomous vehicles. Until recently, point decorating approaches, in which point clouds are augmented with camera features, have been the dominant approach in the field. However, these approaches fail to utilize the higher resolution images from cameras. Recent works projecting camera features to the bird's-eye-view (BEV) space for fusion have also been proposed, however they require projecting millions of pixels, most of which only contain background information. In this work, we propose a novel approach Center Feature Fusion (CFF), in which we leverage center-based detection networks in both the camera and LiDAR streams to identify relevant object locations. We then use the center-based detection to identify the locations of pixel features relevant to object locations, a small fraction of the total number in the image. These are then projected and fused in the BEV frame. On the nuScenes dataset, we outperform the LiDAR-only baseline by 4.9% mAP while fusing up to 100x fewer features than other fusion methods.
CVJul 8, 2022
SST-Calib: Simultaneous Spatial-Temporal Parameter Calibration between LIDAR and CameraAkio Kodaira, Yiyang Zhou, Pengwei Zang et al.
With information from multiple input modalities, sensor fusion-based algorithms usually out-perform their single-modality counterparts in robotics. Camera and LIDAR, with complementary semantic and depth information, are the typical choices for detection tasks in complicated driving environments. For most camera-LIDAR fusion algorithms, however, the calibration of the sensor suite will greatly impact the performance. More specifically, the detection algorithm usually requires an accurate geometric relationship among multiple sensors as the input, and it is often assumed that the contents from these sensors are captured at the same time. Preparing such sensor suites involves carefully designed calibration rigs and accurate synchronization mechanisms, and the preparation process is usually done offline. In this work, a segmentation-based framework is proposed to jointly estimate the geometrical and temporal parameters in the calibration of a camera-LIDAR suite. A semantic segmentation mask is first applied to both sensor modalities, and the calibration parameters are optimized through pixel-wise bidirectional loss. We specifically incorporated the velocity information from optical flow for temporal parameters. Since supervision is only performed at the segmentation level, no calibration label is needed within the framework. The proposed algorithm is tested on the KITTI dataset, and the result shows an accurate real-time calibration of both geometric and temporal parameters.
LGJun 14, 2023
Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement LearningCe Hao, Catherine Weaver, Chen Tang et al.
Hierarchical reinforcement learning (RL) can accelerate long-horizon decision-making by temporally abstracting a policy into multiple levels. Promising results in sparse reward environments have been seen with skills, i.e. sequences of primitive actions. Typically, a skill latent space and policy are discovered from offline data. However, the resulting low-level policy can be unreliable due to low-coverage demonstrations or distribution shifts. As a solution, we propose the Skill-Critic algorithm to fine-tune the low-level policy in conjunction with high-level skill selection. Our Skill-Critic algorithm optimizes both the low-level and high-level policies; these policies are initialized and regularized by the latent space learned from offline demonstrations to guide the parallel policy optimization. We validate Skill-Critic in multiple sparse-reward RL environments, including a new sparse-reward autonomous racing task in Gran Turismo Sport. The experiments show that Skill-Critic's low-level policy fine-tuning and demonstration-guided regularization are essential for good performance. Code and videos are available at our website: https://sites.google.com/view/skill-critic.
LGJun 15, 2023
Residual Q-Learning: Offline and Online Policy Customization without ValueChenran Li, Chen Tang, Haruki Nishimura et al.
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization, and show that the proposed algorithms can effectively accomplish policy customization tasks in various environments. Demo videos and code are available on our website: https://sites.google.com/view/residualq-learning.
CVJun 29, 2023
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous TrainingZitian Chen, Mingyu Ding, Yikang Shen et al.
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies. To address these challenges, we propose to modify and scale up mixture-of-experts (MoE) vision transformers, so that they can simultaneously learn classification, detection, and segmentation on diverse mainstream vision datasets including ImageNet, COCO, and ADE20K. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks. Due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks. We can fine-tune it with fewer training parameters, fewer model parameters, and less computation. Additionally, its modularity allows for easy expansion in continual-learning-without-forgetting scenarios. Finally, these functions can be controlled and combined to meet various demands of downstream tasks.
LGAug 6, 2022
Generalizability Analysis of Graph-based Trajectory Predictor with Vectorized RepresentationJuanwu Lu, Wei Zhan, Masayoshi Tomizuka et al.
Trajectory prediction is one of the essential tasks for autonomous vehicles. Recent progress in machine learning gave birth to a series of advanced trajectory prediction algorithms. Lately, the effectiveness of using graph neural networks (GNNs) with vectorized representations for trajectory prediction has been demonstrated by many researchers. Nonetheless, these algorithms either pay little attention to models' generalizability across various scenarios or simply assume training and test data follow similar statistics. In fact, when test scenarios are unseen or Out-of-Distribution (OOD), the resulting train-test domain shift usually leads to significant degradation in prediction performance, which will impact downstream modules and eventually lead to severe accidents. Therefore, it is of great importance to thoroughly investigate the prediction models in terms of their generalizability, which can not only help identify their weaknesses but also provide insights on how to improve these models. This paper proposes a generalizability analysis framework using feature attribution methods to help interpret black-box models. For the case study, we provide an in-depth generalizability analysis of one of the state-of-the-art graph-based trajectory predictors that utilize vectorized representation. Results show significant performance degradation due to domain shift, and feature attribution provides insights to identify potential causes of these problems. Finally, we conclude the common prediction challenges and how weighting biases induced by the training process can deteriorate the accuracy.
ROJul 1, 2024
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot LearningYixiao Wang, Yifei Zhang, Mingxiao Huo et al.
The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in https://forrest-110.github.io/sparse_diffusion_policy/.
98.9CVApr 9
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End DrivingHao Shao, Letian Wang, Yang Zhou et al. · tsinghua
Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
ROAug 9, 2022
Analyzing and Enhancing Closed-loop Stability in Reactive SimulationWei-Jer Chang, Yeping Hu, Chenran Li et al.
Simulation has played an important role in efficiently evaluating self-driving vehicles in terms of scalability. Existing methods mostly rely on heuristic-based simulation, where traffic participants follow certain human-encoded rules that fail to generate complex human behaviors. Therefore, the reactive simulation concept is proposed to bridge the human behavior gap between simulation and real-world traffic scenarios by leveraging real-world data. However, these reactive models can easily generate unreasonable behaviors after a few steps of simulation, where we regard the model as losing its stability. To the best of our knowledge, no work has explicitly discussed and analyzed the stability of the reactive simulation framework. In this paper, we aim to provide a thorough stability analysis of the reactive simulation and propose a solution to enhance the stability. Specifically, we first propose a new reactive simulation framework, where we discover that the smoothness and consistency of the simulated state sequences are crucial factors to stability. We then incorporate the kinematic vehicle model into the framework to improve the closed-loop stability of the reactive simulation. Furthermore, along with commonly-used metrics, several novel metrics are proposed in this paper to better analyze the simulation performance.
91.5CVMar 24
UniQueR: Unified Query-based Feedforward 3D ReconstructionChensheng Peng, Quentin Herau, Jiezhi Yang et al. · berkeley
We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions--in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.
LGJun 1, 2023
Doubly Robust Self-TrainingBanghua Zhu, Mingyu Ding, Philip Jacobson et al.
Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
CVOct 3, 2023
RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous DrivingTong Zhao, Chenfeng Xu, Mingyu Ding et al.
This paper addresses the growing demands for safety and comfort in intelligent robot systems, particularly autonomous vehicles, where road conditions play a pivotal role in overall driving performance. For example, reconstructing road surfaces helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction Dataset (RSRD), a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions. It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps, with accurate post-processing pipelines to ensure its quality. Based on RSRD, we further build a comprehensive benchmark for recovering road profiles through depth estimation and stereo matching. Preliminary evaluations with various state-of-the-art methods reveal the effectiveness of our dataset and the challenge of the task, underscoring substantial opportunities of RSRD as a valuable resource for advancing techniques, e.g., multi-view stereo towards safe autonomous driving. The dataset and demo videos are available at https://thu-rsxd.com/rsrd/
MAOct 11, 2023
Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient GeneralizationYuxin Chen, Chen Tang, Ran Tian et al.
Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL). The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. A quantitative examination of this relationship sheds light on effectively training agents for diverse scenarios. In this study, we present the Level of Influence (LoI), a metric quantifying the interaction intensity among agents within a given scenario and environment. We observe that, generally, a more diverse set of co-play agents during training enhances the generalization performance of the ego agent; however, this improvement varies across distinct scenarios and environments. LoI proves effective in predicting these improvement disparities within specific scenarios. Furthermore, we introduce a LoI-guided resource allocation method tailored to train a set of policies for diverse scenarios under a constrained budget. Our results demonstrate that strategic resource allocation based on LoI can achieve higher performance than uniform allocation under the same computation budget.
AIFeb 24
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without ReasoningIshaan Rawal, Shubh Gupta, Yihan Hu et al.
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/
ROFeb 3
HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous TrafficYu-Hsiang Chen, Wei-Jer Chang, Christian Kotulla et al.
We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navi- gating real-world heterogeneous traffic dominated by vulner- able road users (VRUs), including pedestrians, cyclists, and motorcyclists that interact with vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large- scale drone-based dataset to provide a holistic observation of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high- fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, het- erogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: https://hetroddata.github.io/HetroD/
CVSep 17, 2024
TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object DetectionPhilip Jacobson, Yichen Xie, Mingyu Ding et al.
Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.
81.3CVMay 22
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video GenerationBo Jiang, Depu Meng, Yihan Hu et al.
Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.
CVNov 18, 2024Code
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving ScenesChensheng Peng, Chengwei Zhang, Yixiao Wang et al. · berkeley
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at https://github.com/chengweialan/DeSiRe-GS
CVOct 24, 2024Code
PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary ViewsXin Fei, Wenzhao Zheng, Yueqi Duan et al.
We propose PixelGaussian, an efficient feed-forward framework for learning generalizable 3D Gaussian reconstruction from arbitrary views. Most existing methods rely on uniform pixel-wise Gaussian representations, which learn a fixed number of 3D Gaussians for each view and cannot generalize well to more input views. Differently, our PixelGaussian dynamically adapts both the Gaussian distribution and quantity based on geometric complexity, leading to more efficient representations and significant improvements in reconstruction quality. Specifically, we introduce a Cascade Gaussian Adapter to adjust Gaussian distribution according to local geometry complexity identified by a keypoint scorer. CGA leverages deformable attention in context-aware hypernetworks to guide Gaussian pruning and splitting, ensuring accurate representation in complex regions while reducing redundancy. Furthermore, we design a transformer-based Iterative Gaussian Refiner module that refines Gaussian representations through direct image-Gaussian interactions. Our PixelGaussian can effectively reduce Gaussian redundancy as input views increase. We conduct extensive experiments on the large-scale ACID and RealEstate10K datasets, where our method achieves state-of-the-art performance with good generalization to various numbers of views. Code: https://github.com/Barrybarry-Smith/PixelGaussian.
CVDec 9, 2024Code
Driv3R: Learning Dense 4D Reconstruction for Autonomous DrivingXin Fei, Wenzhao Zheng, Yueqi Duan et al.
Realtime 4D reconstruction for dynamic scenes remains a crucial challenge for autonomous driving perception. Most existing methods rely on depth estimation through self-supervision or multi-modality sensor fusion. In this paper, we propose Driv3R, a DUSt3R-based framework that directly regresses per-frame point maps from multi-view image sequences. To achieve streaming dense reconstruction, we maintain a memory pool to reason both spatial relationships across sensors and dynamic temporal contexts to enhance multi-view 3D consistency and temporal integration. Furthermore, we employ a 4D flow predictor to identify moving objects within the scene to direct our network focus more on reconstructing these dynamic regions. Finally, we align all per-frame pointmaps consistently to the world coordinate system in an optimization-free manner. We conduct extensive experiments on the large-scale nuScenes dataset to evaluate the effectiveness of our method. Driv3R outperforms previous frameworks in 4D dynamic scene reconstruction, achieving 15x faster inference speed compared to methods requiring global alignment. Code: https://github.com/Barrybarry-Smith/Driv3R.
71.2CVApr 20
URoPE: Universal Relative Position Embedding across Geometric SpacesYichen Xie, Depu Meng, Chensheng Peng et al.
Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.
CVDec 12, 2024Code
DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous DrivingHao Lu, Tianshuo Xu, Wenzhao Zheng et al.
Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to better learn geometry and motion features. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code is available at https://github.com/EnVision-Research/DriveRecon.
CVNov 2, 2024Code
X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenariosYichen Xie, Chenfeng Xu, Chensheng Peng et al. · berkeley
Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under-exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual-branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at https://github.com/yichen928/X-Drive.
CVFeb 14, 2024Code
Depth-aware Volume Attention for Texture-less Stereo MatchingTong Zhao, Mingyu Ding, Wei Zhan et al.
Stereo matching plays a crucial role in 3D perception and scenario understanding. Despite the proliferation of promising methods, addressing texture-less and texture-repetitive conditions remains challenging due to the insufficient availability of rich geometric and semantic information. In this paper, we propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios. Specifically, we introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture. Subsequently, the disparity discrepancy volume undergoes hierarchical filtering through the incorporation of depth-aware hierarchy attention and target-aware disparity attention modules. Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation. Furthermore, we propose a more rigorous evaluation metric that considers depth-wise relative error, providing comprehensive evaluations for universal stereo matching and depth estimation models. We extensively validate the superiority of our proposed methods on public datasets. Results demonstrate that our model achieves state-of-the-art performance, particularly excelling in scenarios with texture-less images. The code is available at https://github.com/ztsrxh/DVANet.
LGFeb 22, 2024Code
BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing GameplayCatherine Weaver, Chen Tang, Ce Hao et al.
Imitation learning learns a policy from demonstrations without requiring hand-designed reward functions. In many robotic tasks, such as autonomous racing, imitated policies must model complex environment dynamics and human decision-making. Sequence modeling is highly effective in capturing intricate patterns of motion sequences but struggles to adapt to new environments or distribution shifts that are common in real-world robotics tasks. In contrast, Adversarial Imitation Learning (AIL) can mitigate this effect, but struggles with sample inefficiency and handling complex motion patterns. Thus, we propose BeTAIL: Behavior Transformer Adversarial Imitation Learning, which combines a Behavior Transformer (BeT) policy from human demonstrations with online AIL. BeTAIL adds an AIL residual policy to the BeT policy to model the sequential decision-making process of human experts and correct for out-of-distribution states or shifts in environment dynamics. We test BeTAIL on three challenges with expert-level demonstrations of real human gameplay in Gran Turismo Sport. Our proposed residual BeTAIL reduces environment interactions and improves racing performance and stability, even when the BeT is pretrained on different tracks than downstream learning. Videos and code available at: https://sites.google.com/berkeley.edu/BeTAIL/home.
68.4CVApr 3
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving ScenesQuentin Herau, Tianshuo Xu, Depu Meng et al.
Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.
CVDec 15, 2021Code
Towards General and Efficient Active LearningYichen Xie, Masayoshi Tomizuka, Wei Zhan
Active learning selects the most informative samples to exploit limited annotation budgets. Existing work follows a cumbersome pipeline that repeats the time-consuming model training and batch data selection multiple times. In this paper, we challenge this status quo by proposing a novel general and efficient active learning (GEAL) method following our designed new pipeline. Utilizing a publicly available pretrained model, our method selects data from different datasets with a single-pass inference of the same model without extra training or supervision. To capture subtle local information, we propose knowledge clusters extracted from intermediate features. Free from the troublesome batch selection strategy, all data samples are selected in one-shot through a distance-based sampling in the fine-grained knowledge cluster level. This whole process is faster than prior arts by hundreds of times. Extensive experiments verify the effectiveness of our method on object detection, image classification, and semantic segmentation. Our code is publicly available in https://github.com/yichen928/GEAL_active_learning.
CVJun 8, 2021Code
Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained ModelsChenfeng Xu, Shijia Yang, Tomer Galanti et al.
3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper explores the potential of transferring 2D model architectures and weights to understand 3D point-clouds, by empirically investigating the feasibility of the transfer, the benefits of the transfer, and shedding light on why the transfer works. We discover that we can indeed use the same architecture and pretrained weights of a neural net model to understand both images and point-clouds. Specifically, we transfer the image-pretrained model to a point-cloud model by copying or inflating the weights. We find that finetuning the transformed image-pretrained models (FIP) with minimal efforts -- only on input, output, and normalization layers -- can achieve competitive performance on 3D point-cloud classification, beating a wide range of point-cloud models that adopt task-specific architectures and use a variety of tricks. When finetuning the whole model, the performance improves even further. Meanwhile, FIP improves data efficiency, reaching up to 10.0 top-1 accuracy percent on few-shot classification. It also speeds up the training of point-cloud models by up to 11.1x for a target accuracy (e.g., 90 % accuracy). Lastly, we provide an explanation of the image to point-cloud transfer from the aspect of neural collapse. The code is available at: \url{https://github.com/chenfengxu714/image2point}.