CVMar 23, 2023Code
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-TrainingRunsen Xu, Tai Wang, Wenwei Zhang et al. · cmu
This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR .
CVJun 8, 2022
CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous DrivingRunjian Chen, Yao Mu, Runsen Xu et al.
Unsupervised contrastive learning for indoor-scene point clouds has achieved great successes. However, unsupervised learning point clouds in outdoor scenes remains challenging because previous methods need to reconstruct the whole scene and capture partial views for the contrastive objective. This is infeasible in outdoor scenes with moving objects, obstacles, and sensors. In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner. CO^3 has several merits compared to existing methods. (1) It utilizes LiDAR point clouds from vehicle-side and infrastructure-side to build views that differ enough but meanwhile maintain common semantic information for contrastive learning, which are more appropriate than views built by previous methods. (2) Alongside the contrastive objective, shape context prediction is proposed as pre-training goal and brings more task-relevant information for unsupervised 3D point cloud representation learning, which are beneficial when transferring the learned representation to downstream detection tasks. (3) As compared to previous methods, representation learned by CO^3 is able to be transferred to different outdoor scene dataset collected by different type of LiDAR sensors. (4) CO^3 improves current state-of-the-art methods on both Once and KITTI datasets by up to 2.58 mAP. We believe CO^3 will facilitate understanding LiDAR point clouds in outdoor scene.
CVNov 20, 2023Code
CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage RefinementBoni Hu, Lin Chen, Runjian Chen et al.
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images, taken at some unknown location, to a set of geo-tagged reference images. Existing methods, devoted to semantic features representation, evolving towards robustness to a wide variety between query and reference, including illumination and viewpoint changes, as well as scale and seasonal variations. However, practical visual geolocalization approaches need to be robust in appearance changing and extreme viewpoint variation conditions, while providing accurate global location estimates. Therefore, inspired by curriculum design, human learn general knowledge first and then delve into professional expertise. We first recognize semantic scene and then measure geometric structure. Our approach, termed CurriculumLoc, involves a delicate design of multi-stage refinement pipeline and a novel keypoint detection and description with global semantic awareness and local geometric verification. We rerank candidates and solve a particular cross-domain perspective-n-point (PnP) problem based on these keypoints and corresponding descriptors, position refinement occurs incrementally. The extensive experimental results on our collected dataset, TerraTrack and a benchmark dataset, ALTO, demonstrate that our approach results in the aforementioned desirable characteristics of a practical visual geolocalization solution. Additionally, we achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively. Dataset, code and trained models are publicly available on https://github.com/npupilab/CurriculumLoc.
CVJun 17, 2022
CtrlFormer: Learning Transferable State Representation for Visual Control via TransformerYao Mu, Shoufa Chen, Mingyu Ding et al.
Transformer has achieved great successes in learning vision and language representation, which is general across various downstream tasks. In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size. However, porting Transformer to sample-efficient visual control remains a challenging and unsolved problem. To this end, we propose a novel Control Transformer (CtrlFormer), possessing many appealing benefits that prior arts do not have. Firstly, CtrlFormer jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting. Secondly, we carefully design a contrastive reinforcement learning paradigm to train CtrlFormer, enabling it to achieve high sample efficiency, which is important in control problems. For example, in the DMControl benchmark, unlike recent advanced methods that failed by producing a zero score in the "Cartpole" task after transfer learning with 100k samples, CtrlFormer can achieve a state-of-the-art score with only 100k samples while maintaining the performance of previous tasks. The code and models are released in our project homepage.
CVSep 19, 2023
SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D RepresentationsXiangchao Yan, Runjian Chen, Bo Zhang et al.
Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g., autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
CVApr 24, 2024Code
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGIKaining Ying, Fanqing Meng, Jin Wang et al.
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
CVJun 10, 2025Code
Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation ModelsXuanchi Ren, Yifan Lu, Tianshi Cao et al. · nvidia, utoronto
Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams
CVJul 21, 2021Code
CycleMLP: A MLP-like Architecture for Dense PredictionShoufa Chen, Enze Xie, Chongjian Ge et al.
This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have $O(N^2)$ computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.
ROFeb 25, 2024
RoboCodeX: Multimodal Code Generation for Robotic Behavior SynthesisYao Mu, Junting Chen, Qinglong Zhang et al.
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
CVDec 4, 2024
TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR PerceptionRunjian Chen, Hyoungseob Park, Bo Zhang et al.
Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception. Codes and models will be released.
CYMar 4, 2024
Position: Towards Implicit Prompt For Text-To-Image ModelsYue Yang, Yuqi Lin, Hong Liu et al.
Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2,000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models' capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.
CVMar 11, 2025
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World DataRunjian Chen, Wenqi Shao, Bo Zhang et al.
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM , shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
CVMar 10, 2025
Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object SegmentationZiliang Miao, Runjian Chen, Yixi Cai et al.
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.
CVDec 4, 2024
CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype LearningRunjian Chen, Hang Zhang, Avinash Ravichandran et al.
Unsupervised 3D representation learning reduces the burden of labeling multimodal 3D data for fusion perception tasks. Among different pre-training paradigms, differentiable-rendering-based methods have shown most promise. However, existing works separately conduct pre-training for each modalities due to computational costs of processing large point clouds with images. As such, mutual benefit of high-level semantics (from image) and 3D structure (from point cloud) has not been exploited. To address this gap, we propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP, short for Curvature sampLing and leArnable Prototype. Specifically, our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training. To uncover the performance benefits brought by their complementarity, we propose to use learnable prototypes to represent parts of the 3D scenes in a common feature space and an Expectation-Maximization training scheme to associate embeddings of each modality to prototypes. We further propose a swapping prediction loss that explores their interplay through prototypes along with a Gram Matrix Regularization term to maintain training stability. Experiments on NuScenes and Waymo datasets show that CLAP achieves up to 100% more performance gain as compared to previous SOTA pre-training methods. Codes and models will be released.
CVJan 17, 2022
RestoreFormer: High-Quality Blind Face Restoration from Undegraded Key-Value PairsZhouxia Wang, Jiawei Zhang, Runjian Chen et al.
Blind face restoration is to recover a high-quality face image from unknown degradations. As face image contains abundant contextual information, we propose a method, RestoreFormer, which explores fully-spatial attentions to model contextual information and surpasses existing works that use local operators. RestoreFormer has several benefits compared to prior arts. First, unlike the conventional multi-head self-attention in previous Vision Transformers (ViTs), RestoreFormer incorporates a multi-head cross-attention layer to learn fully-spatial interactions between corrupted queries and high-quality key-value pairs. Second, the key-value pairs in ResotreFormer are sampled from a reconstruction-oriented high-quality dictionary, whose elements are rich in high-quality facial features specifically aimed for face reconstruction, leading to superior restoration results. Third, RestoreFormer outperforms advanced state-of-the-art methods on one synthetic dataset and three real-world datasets, as well as produces images with better visual quality.
ROSep 15, 2020
RaLL: End-to-end Radar Localization on Lidar Map Using Differentiable Measurement ModelHuan Yin, Runjian Chen, Yue Wang et al.
Compared to the onboard camera and laser scanner, radar sensor provides lighting and weather invariant sensing, which is naturally suitable for long-term localization under adverse conditions. However, radar data is sparse and noisy, resulting in challenges for radar mapping. On the other hand, the most popular available map currently is built by lidar. In this paper, we propose an end-to-end deep learning framework for Radar Localization on Lidar Map (RaLL) to bridge the gap, which not only achieves the robust radar localization but also exploits the mature lidar mapping technique, thus reducing the cost of radar mapping. We first embed both sensor modals into a common feature space by a neural network. Then multiple offsets are added to the map modal for exhaustive similarity evaluation against the current radar modal, yielding the regression of the current pose. Finally, we apply this differentiable measurement model to a Kalman Filter (KF) to learn the whole sequential localization process in an end-to-end manner. \textit{The whole learning system is differentiable with the network based measurement model at the front-end and KF at the back-end.} To validate the feasibility and effectiveness, we employ multi-session multi-scene datasets collected from the real world, and the results demonstrate that our proposed system achieves superior performance over $90km$ driving, even in generalization scenarios where the model training is in UK, while testing in South Korea. We also release the source code publicly.
ROSep 1, 2020
Deep Samplable Observation Model for Global Localization and KidnappingRunjian Chen, Huan Yin, Yanmei Jiao et al.
Global localization and kidnapping are two challenging problems in robot localization. The popular method, Monte Carlo Localization (MCL) addresses the problem by iteratively updating a set of particles with a "sampling-weighting" loop. Sampling is decisive to the performance of MCL [1]. However, traditional MCL can only sample from a uniform distribution over the state space. Although variants of MCL propose different sampling models, they fail to provide an accurate distribution or generalize across scenes. To better deal with these problems, we present a distribution proposal model, named Deep Samplable Observation Model (DSOM). DSOM takes a map and a 2D laser scan as inputs and outputs a conditional multimodal probability distribution of the pose, making the samples more focusing on the regions with higher likelihood. With such samples, the convergence is expected to be more effective and efficient. Considering that the learning-based sampling model may fail to capture the true pose sometimes, we furthermore propose the Adaptive Mixture MCL (AdaM MCL), which deploys a trusty mechanism to adaptively select updating mode for each particle to tolerate this situation. Equipped with DSOM, AdaM MCL can achieve more accurate estimation, faster convergence and better scalability compared to previous methods in both synthetic and real scenes. Even in real environments with long-term changing, AdaM MCL is able to localize the robot using DSOM trained only by simulation observations from a SLAM map or a blueprint map.
LGMay 16, 2020
Graph Partitioning and Graph Neural Network based Hierarchical Graph Matching for Graph Similarity ComputationHaoyan Xu, Ziheng Duan, Jie Feng et al.
Graph similarity computation aims to predict a similarity score between one pair of graphs to facilitate downstream applications, such as finding the most similar chemical compounds similar to a query compound or Fewshot 3D Action Recognition. Recently, some graph similarity computation models based on neural networks have been proposed, which are either based on graph-level interaction or node-level comparison. However, when the number of nodes in the graph increases, it will inevitably bring about reduced representation ability or high computation cost. Motivated by this observation, we propose a graph partitioning and graph neural network-based model, called PSimGNN, to effectively resolve this issue. Specifically, each of the input graphs is partitioned into a set of subgraphs to extract the local structural features directly. Next, a novel graph neural network with an attention mechanism is designed to map each subgraph into an embedding vector. Some of these subgraph pairs are automatically selected for node-level comparison to supplement the subgraph-level embedding with fine-grained information. Finally, coarse-grained interaction information among subgraphs and fine-grained comparison information among nodes in different subgraphs are integrated to predict the final similarity score. Experimental results on graph datasets with different graph sizes demonstrate that PSimGNN outperforms state-of-the-art methods in graph similarity computation tasks using approximate Graph Edit Distance (GED) as the graph similarity metric.
LGMay 14, 2020
CoSimGNN: Towards Large-scale Graph Similarity ComputationHaoyan Xu, Runjian Chen, Yueyang Wang et al.
The ability to compute similarity scores between graphs based on metrics such as Graph Edit Distance (GED) is important in many real-world applications. Computing exact GED values is typically an NP-hard problem and traditional algorithms usually achieve an unsatisfactory trade-off between accuracy and efficiency. Recently, Graph Neural Networks (GNNs) provide a data-driven solution for this task, which is more efficient while maintaining prediction accuracy in small graph (around 10 nodes per graph) similarity computation. Existing GNN-based methods, which either respectively embeds two graphs (lack of low-level cross-graph interactions) or deploy cross-graph interactions for whole graph pairs (redundant and time-consuming), are still not able to achieve competitive results when the number of nodes in graphs increases. In this paper, we focus on similarity computation for large-scale graphs and propose the "embedding-coarsening-matching" framework CoSimGNN, which first embeds and coarsens large graphs with adaptive pooling operation and then deploys fine-grained interactions on the coarsened graphs for final similarity scores. Furthermore, we create several synthetic datasets which provide new benchmarks for graph similarity computation. Detailed experiments on both synthetic and real-world datasets have been conducted and CoSimGNN achieves the best performance while the inference time is at most 1/3 of that of previous state-of-the-art.