CVDec 15, 2022
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous DrivingAndrei Zanfir, Mihai Zanfir, Alexander Gorban et al.
Autonomous driving is an exciting new industry, posing important research questions. Within the perception module, 3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. While hardware systems and sensors have dramatically improved over the decades -- with cars potentially boasting complex LiDAR and vision systems and with a growing expansion of the available body of dedicated datasets for this newly available information -- not much work has been done to harness these novel signals for the core problem of 3D human pose estimation. Our method, which we coin HUM3DIL (HUMan 3D from Images and LiDAR), efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. It is a fast and compact model for onboard deployment. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages. Quantitative experiments on the Waymo Open Dataset support these claims, where we achieve state-of-the-art results on the task of 3D pose estimation.
CVOct 14, 2022
Motion Inspired Unsupervised Perception and Prediction in Autonomous DrivingMahyar Najibi, Jingwei Ji, Yin Zhou et al.
Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.
CVSep 25, 2023
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous DrivingMahyar Najibi, Jingwei Ji, Yin Zhou et al.
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.
CVJun 7, 2023
3D Human Keypoints Estimation From Point Clouds in the Wild Without Human LabelsZhenzhen Weng, Alexander S. Gorban, Jingwei Ji et al.
Training a 3D human keypoint detector from point clouds in a supervised manner requires large volumes of high quality labels. While it is relatively easy to capture large amounts of human point clouds, annotating 3D keypoints is expensive, subjective, error prone and especially difficult for long-tail cases (pedestrians with rare poses, scooterists, etc.). In this work, we propose GC-KPL - Geometry Consistency inspired Key Point Leaning, an approach for learning 3D human joint locations from point clouds without human labels. We achieve this by our novel unsupervised loss formulations that account for the structure and movement of the human body. We show that by training on a large training set from Waymo Open Dataset without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream fewshot learning of keypoints, where fine-tuning on only 10 percent of the labeled training data gives comparable performance to fine-tuning on the entire set. We demonstrated that GC-KPL outperforms by a large margin over SoTA when trained on entire dataset and efficiently leverages large volumes of unlabeled data.
80.7CEApr 2
The Pandora's Box Problem with Sequential InspectionsAli Aouad, Jingwei Ji, Yaron Shaposhnik
The Pandora's box problem (Weitzman 1979) is a core model in economic theory that captures an agent's (Pandora's) search for the best alternative (box). We study an important generalization of the problem where the agent can either fully open boxes for a certain fee to reveal their exact values or partially open them at a reduced cost. This introduces a new tradeoff between information acquisition and cost efficiency. We establish a hardness result and employ an array of techniques in stochastic optimization to provide a comprehensive analysis of this model. This includes (1) the identification of structural properties of the optimal policy that provide insights about optimal decisions; (2) the derivation of problem relaxations and provably near-optimal solutions; (3) the characterization of the optimal policy in special yet non-trivial cases; and (4) an extensive numerical study that compares the performance of various policies, and which provides additional insights about the optimal policy. Throughout, we show that intuitive threshold-based policies that extend the Pandora's box optimal solution can effectively guide search decisions.
LGAug 4, 2022
Risk-Aware Linear Bandits: Theory and Applications in Smart Order RoutingJingwei Ji, Renyuan Xu, Ruihao Zhu
Motivated by practical considerations in machine learning for financial decision-making, such as risk aversion and large action space, we consider risk-aware bandits optimization with applications in smart order routing (SOR). Specifically, based on preliminary observations of linear price impacts made from the NASDAQ ITCH dataset, we initiate the study of risk-aware linear bandits. In this setting, we aim at minimizing regret, which measures our performance deficit compared to the optimum's, under the mean-variance metric when facing a set of actions whose rewards are linear functions of (initially) unknown parameters. Driven by the variance-minimizing globally-optimal (G-optimal) design, we propose the novel instance-independent Risk-Aware Explore-then-Commit (RISE) algorithm and the instance-dependent Risk-Aware Successive Elimination (RISE++) algorithm. Then, we rigorously analyze their near-optimal regret upper bounds to show that, by leveraging the linear structure, our algorithms can dramatically reduce the regret when compared to existing methods. Finally, we demonstrate the performance of the algorithms by conducting extensive numerical experiments in the SOR setup using both synthetic datasets and the NASDAQ ITCH dataset. Our results reveal that 1) The linear structure assumption can indeed be well supported by the Nasdaq dataset; and more importantly 2) Both RISE and RISE++ can significantly outperform the competing methods, in terms of regret, especially in complex decision-making scenarios.
50.2CCMay 18
Risk of Bad Tails: CVaR-Aware Pandora's Box and Prophet InequalityJingwei Ji
We study Conditional Value-at-Risk (CVaR) variants of two canonical sequential decision problems: Pandora's box and the prophet inequality. For Pandora's box, the risk-aware problem retains an exact Weitzman-style index solution after a one-dimensional variational reduction. For the prophet inequality, the picture is different: for every CVaR level $α\in(0,1)$, no positive constant approximation guarantee can hold without distributional structure, in sharp contrast with the risk-neutral case $α=1$, and we characterize the tight instance-dependent guarantee. Already in two-item hard instances, the prophet's CVaR benchmark can be made arbitrarily large while every online policy's CVaR remains bounded. This impossibility is due to the nature of CVaR objective: it measures only the worst $α$-fraction of outcomes, so any compromise an online policy makes to preserve the chance of a large payoff in the upper $(1-α)$-fraction does not help its CVaR. It turns out that additional distributional structure restores a uniform result: under continuous reward distributions satisfying a recentered increasing-failure-rate-average (IFRA) condition, a threshold policy achieves an explicit constant bound.
CVOct 30, 2024
EMMA: End-to-End Multimodal Model for Autonomous DrivingJyh-Jing Hwang, Runsheng Xu, Hubert Lin et al.
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built upon a multi-modal large language model foundation like Gemini, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.
CVApr 30, 2024
MoST: Multi-modality Scene Tokenization for Motion PredictionNorman Mu, Jingwei Ji, Zhenpei Yang et al.
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
CVJan 4, 2024
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language DistillationZihao Xiao, Longlong Jing, Shangxuan Wu et al.
3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin.
CVMay 30, 2025
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual RepresentationYichen Xie, Runsheng Xu, Tong He et al.
The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.
PROct 18, 2024
Multi-Task Dynamic Pricing in Credit Market with Contextual InformationAdel Javanmard, Jingwei Ji, Renyuan Xu
We study the dynamic pricing problem faced by a broker seeking to learn prices for a large number of credit market securities, such as corporate bonds, government bonds, loans, and other credit-related securities. A major challenge in pricing these securities stems from their infrequent trading and the lack of transparency in over-the-counter (OTC) markets, which leads to insufficient data for individual pricing. Nevertheless, many securities share structural similarities that can be exploited. Moreover, brokers often place small "probing" orders to infer competitors' pricing behavior. Leveraging these insights, we propose a multi-task dynamic pricing framework that leverages the shared structure across securities to enhance pricing accuracy. In the OTC market, a broker wins a quote by offering a more competitive price than rivals. The broker's goal is to learn winning prices while minimizing expected regret against a clairvoyant benchmark. We model each security using a $d$-dimensional feature vector and assume a linear contextual model for the competitor's pricing of the yield, with parameters unknown a priori. We propose the Two-Stage Multi-Task (TSMT) algorithm: first, an unregularized MLE over pooled data to obtain a coarse parameter estimate; second, a regularized MLE on individual securities to refine the parameters. We show that the TSMT achieves a regret bounded by $\tilde{O} ( δ_{\max} \sqrt{T M d} + M d ) $, outperforming both fully individual and fully pooled baselines, where $M$ is the number of securities and $δ_{\max}$ quantifies their heterogeneity.
CVOct 20, 2025
Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language ModelsKatie Luo, Jingwei Ji, Tong He et al.
Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.
CVMay 11, 2021
Home Action Genome: Cooperative Compositional Action UnderstandingNishant Rai, Haofeng Chen, Jingwei Ji et al.
Existing research on action recognition treats activities as monolithic events occurring in videos. Recently, the benefits of formulating actions as a combination of atomic-actions have shown promise in improving action understanding with the emergence of datasets containing such annotations, allowing us to learn representations capturing this information. However, there remains a lack of studies that extend action composition and leverage multiple viewpoints and multiple modalities of data for representation learning. To promote research in this direction, we introduce Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. Leveraging rich multi-modal and multi-view settings, we propose Cooperative Compositional Action Understanding (CCAU), a cooperative learning framework for hierarchical action recognition that is aware of compositional action elements. CCAU shows consistent performance improvements across all modalities. Furthermore, we demonstrate the utility of co-learning compositions in few-shot action recognition by achieving 28.6% mAP with just a single sample.
CVDec 15, 2019
Action Genome: Actions as Composition of Spatio-temporal Scene GraphsJingwei Ji, Ranjay Krishna, Li Fei-Fei et al.
Action recognition has typically treated actions and activities as monolithic events that occur in videos. However, there is evidence from Cognitive Science and Neuroscience that people actively encode activities into consistent hierarchical part structures. However in Computer Vision, few explorations on representations encoding event partonomies have been made. Inspired by evidence that the prototypical unit of an event is an action-object interaction, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. Action Genome captures changes between objects and their pairwise relationships while an action occurs. It contains 10K videos with 0.4M objects and 1.7M visual relationships annotated. With Action Genome, we extend an existing action recognition model by incorporating scene graphs as spatio-temporal feature banks to achieve better performance on the Charades dataset. Next, by decomposing and learning the temporal changes in visual relationships that result in an action, we demonstrate the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples. Finally, we benchmark existing scene graph models on the new task of spatio-temporal scene graph prediction.
CVOct 3, 2019
Learning Temporal Action Proposals With Fewer LabelsJingwei Ji, Kaidi Cao, Juan Carlos Niebles
Temporal action proposals are a common module in action detection pipelines today. Most current methods for training action proposal modules rely on fully supervised approaches that require large amounts of annotated temporal action intervals in long video sequences. The large cost and effort in annotation that this entails motivate us to study the problem of training proposal modules with less supervision. In this work, we propose a semi-supervised learning algorithm specifically designed for training temporal action proposal networks. When only a small number of labels are available, our semi-supervised method generates significantly better proposals than the fully-supervised counterpart and other strong semi-supervised baselines. We validate our method on two challenging action detection video datasets, ActivityNet v1.3 and THUMOS14. We show that our semi-supervised approach consistently matches or outperforms the fully supervised state-of-the-art approaches.
CVJun 27, 2019
Few-Shot Video Classification via Temporal AlignmentKaidi Cao, Jingwei Ji, Zhangjie Cao et al.
There is a growing interest in learning a model which could recognize novel classes with only a few labeled examples. In this paper, we propose Temporal Alignment Module (TAM), a novel few-shot learning framework that can learn to classify a previous unseen video. While most previous works neglect long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, TAM calculates the distance value of query video with respect to novel class proxies by averaging the per frame distances along its alignment path. We introduce continuous relaxation to TAM so the model can be learned in an end-to-end fashion to directly optimize the few-shot learning objective. We evaluate TAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines.
CVAug 11, 2017
DeformNet: Free-Form Deformation Network for 3D Shape Reconstruction from a Single ImageAndrey Kurenkov, Jingwei Ji, Animesh Garg et al.
3D reconstruction from a single image is a key problem in multiple applications ranging from robotic manipulation to augmented reality. Prior methods have tackled this problem through generative models which predict 3D reconstructions as voxels or point clouds. However, these methods can be computationally expensive and miss fine details. We introduce a new differentiable layer for 3D data deformation and use it in DeformNet to learn a model for 3D reconstruction-through-deformation. DeformNet takes an image input, searches the nearest shape template from a database, and deforms the template to match the query image. We evaluate our approach on the ShapeNet dataset and show that - (a) the Free-Form Deformation layer is a powerful new building block for Deep Learning models that manipulate 3D data (b) DeformNet uses this FFD layer combined with shape retrieval for smooth and detail-preserving 3D reconstruction of qualitatively plausible point clouds with respect to a single query image (c) compared to other state-of-the-art 3D reconstruction methods, DeformNet quantitatively matches or outperforms their benchmarks by significant margins. For more information, visit: https://deformnet-site.github.io/DeformNet-website/ .