CVJun 29, 2022Code
SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object Detection and TrackingYining Shi, Jingyan Shen, Yifan Sun et al. · tsinghua
Detection and tracking of moving objects is an essential component in environmental perception for autonomous driving. In the flourishing field of multi-view 3D camera-based detectors, different transformer-based pipelines are designed to learn queries in 3D space from 2D feature maps of perspective views, but the dominant dense BEV query mechanism is computationally inefficient. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-sparse detector that incorporates sparse queries, sparse attention with box-wise sampling, and sparse prediction. SRCN3D adopts a cascade structure with the twin-track update of both a fixed number of query boxes and latent query features. Our novel sparse feature sampling module only utilizes local 2D region of interest (RoI) features calculated by the projection of 3D query boxes for further box refinement, leading to a fully-convolutional and deployment-friendly pipeline. For multi-object tracking, motion features, query features and RoI features are comprehensively utilized in multi-hypotheses data association. Extensive experiments on nuScenes dataset demonstrate that SRCN3D achieves competitive performance in both 3D object detection and multi-object tracking tasks, while also exhibiting superior efficiency compared to transformer-based methods. Code and models are available at https://github.com/synsin0/SRCN3D.
CVAug 25, 2022
Bridging the View Disparity Between Radar and Camera Features for Multi-modal Fusion 3D Object DetectionTaohua Zhou, Yining Shi, Junjie Chen et al. · tsinghua
Environmental perception with the multi-modal fusion of radar and camera is crucial in autonomous driving to increase accuracy, completeness, and robustness. This paper focuses on utilizing millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method that realizes the feature-level fusion under the bird's-eye view (BEV) for a better feature representation is proposed. Firstly, radar points are augmented with temporal accumulation and sent to a spatial-temporal encoder for radar feature extraction. Meanwhile, multi-scale image 2D features which adapt to various spatial scales are obtained by image backbone and neck model. Then, image features are transformed to BEV with the designed view transformer. In addition, this work fuses the multi-modal features with a two-stage fusion model called point-fusion and ROI-fusion, respectively. Finally, a detection head regresses objects category and 3D locations. Experimental results demonstrate that the proposed method realizes the state-of-the-art (SOTA) performance under the most crucial detection metrics-mean average precision (mAP) and nuScenes detection score (NDS) on the challenging nuScenes dataset.
CVMar 2, 2023
Grid-Centric Traffic Scenario Perception for Autonomous Driving: A Comprehensive ReviewYining Shi, Kun Jiang, Jiusi Li et al. · tsinghua
Grid-centric perception is a crucial field for mobile robot perception and navigation. Nonetheless, grid-centric perception is less prevalent than object-centric perception as autonomous vehicles need to accurately perceive highly dynamic, large-scale traffic scenarios and the complexity and computational costs of grid-centric perception are high. In recent years, the rapid development of deep learning techniques and hardware provides fresh insights into the evolution of grid-centric perception. The fundamental difference between grid-centric and object-centric pipeline lies in that grid-centric perception follows a geometry-first paradigm which is more robust to the open-world driving scenarios with endless long-tailed semantically-unknown obstacles. Recent researches demonstrate the great advantages of grid-centric perception, such as comprehensive fine-grained environmental representation, greater robustness to occlusion and irregular shaped objects, better ground estimation, and safer planning policies. There is also a growing trend that the capacity of occupancy networks are greatly expanded to 4D scene perception and prediction and latest techniques are highly related to new research topics such as 4D occupancy forecasting, generative AI and world models in the field of autonomous driving. Given the lack of current surveys for this rapidly expanding field, we present a hierarchically-structured review of grid-centric perception for autonomous vehicles. We organize previous and current knowledge of occupancy grid techniques along the main vein from 2D BEV grids to 3D occupancy to 4D occupancy forecasting. We additionally summarize label-efficient occupancy learning and the role of grid-centric perception in driving systems. Lastly, we present a summary of the current research trend and provide future outlooks.
74.4CVMar 31Code
SparseDriveV2: Scoring is All You Need for End-to-End Autonomous DrivingWenchao Sun, Xuewu Lin, Keyu Chen et al. · tsinghua
End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at https://github.com/swc-17/SparseDriveV2.
CVFeb 19, 2023
StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential EquationYining Shi, Kun Jiang, Ke Wang et al. · tsinghua
Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However, current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict uniform snapshots of future occupancy states and require strictly synchronized sensory data for sensor fusion. We propose a novel framework, StreamingFlow, to lift these strong limitations. StreamingFlow is a novel BEV occupancy predictor that ingests asynchronous multi-sensor data streams for fusion and performs streaming forecasting of the future occupancy map at any future timestamps. By integrating neural ordinary differential equations (N-ODE) into recurrent neural networks, StreamingFlow learns derivatives of BEV features over temporal horizons, updates the implicit sensor's BEV features as part of the fusion process, and propagates BEV states to the desired future time point. It shows good zero-shot generalization ability of prediction, reflected in the interpolation of the observed prediction time horizon and the reasonable inference of the unseen farther future period. Extensive experiments on two large-scale datasets, nuScenes and Lyft L5, demonstrate that StreamingFlow significantly outperforms previous vision-based, LiDAR-based methods, and shows superior performance compared to state-of-the-art fusion-based methods.
LGOct 18, 2023
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic TransformersShaoxiong Duan, Yining Shi, Wei Xu
In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.
ROMay 21, 2025Code
AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous DrivingKangan Qian, Sicong Jiang, Yang Zhong et al. · tsinghua
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.
61.7ROMay 18
4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous DrivingKane Qian, Xin Zhao, Yining Shi et al.
We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.
CVJun 16, 2025Code
COME: Adding Scene-Centric Forecasting Control to Occupancy World ModelYining Shi, Kun Jiang, Qiang Meng et al. · tsinghua
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.
LGApr 24, 2025
TileLang: A Composable Tiled Programming Model for AI SystemsLei Wang, Yu Cheng, Yining Shi et al.
Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.
ROMar 11, 2025
FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive FeedbackKangan Qian, Ziang Luo, Sicong Jiang et al. · tsinghua
Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by the dual-process cognitive model "Thinking, Fast and Slow", we propose $\textbf{FASIONAD}$ -- a novel dual-system framework that synergizes a fast end-to-end planner with a VLM-based reasoning module. The fast system leverages end-to-end learning to achieve real-time trajectory generation in common scenarios, while the slow system activates through uncertainty estimation to perform contextual analysis and complex scenario resolution. Our architecture introduces three key innovations: (1) A dynamic switching mechanism enabling slow system intervention based on real-time uncertainty assessment; (2) An information bottleneck with high-level plan feedback that optimizes the slow system's guidance capability; (3) A bidirectional knowledge exchange where visual prompts enhance the slow system's reasoning while its feedback refines the fast planner's decision-making. To strengthen VLM reasoning, we develop a question-answering mechanism coupled with reward-instruct training strategy. In open-loop experiments, FASIONAD achieves a $6.7\%$ reduction in average $L2$ trajectory error and $28.1\%$ lower collision rate.
CVMar 10, 2025
LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion PredictionKangan Qian, Jinyu Miao, Ziang Luo et al. · tsinghua
Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.
RONov 27, 2024
FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive FeedbackKangan Qian, Zhikun Ma, Yangfan He et al. · tsinghua
Ensuring safe, comfortable, and efficient navigation is a critical goal for autonomous driving systems. While end-to-end models trained on large-scale datasets excel in common driving scenarios, they often struggle with rare, long-tail events. Recent progress in large language models (LLMs) has introduced enhanced reasoning capabilities, but their computational demands pose challenges for real-time decision-making and precise planning. This paper presents FASIONAD, a novel dual-system framework inspired by the cognitive model "Thinking, Fast and Slow." The fast system handles routine navigation tasks using rapid, data-driven path planning, while the slow system focuses on complex reasoning and decision-making in challenging or unfamiliar situations. A dynamic switching mechanism based on score distribution and feedback allows seamless transitions between the two systems. Visual prompts generated by the fast system enable human-like reasoning in the slow system, which provides high-quality feedback to enhance the fast system's decision-making. To evaluate FASIONAD, we introduce a new benchmark derived from the nuScenes dataset, specifically designed to differentiate fast and slow scenarios. FASIONAD achieves state-of-the-art performance on this benchmark, establishing a new standard for frameworks integrating fast and slow cognitive processes in autonomous driving. This approach paves the way for more adaptive, human-like autonomous driving systems.
CVDec 5, 2024
How Cars Move: Analyzing Driving Dynamics for Safer Urban TrafficKangan Qian, Jinyu Miao, Xinyu Jiao et al. · tsinghua
Understanding the spatial dynamics of cars within urban systems is essential for optimizing infrastructure management and resource allocation. Recent empirical approaches for analyzing traffic patterns have gained traction due to their applicability to city-scale policy development. However, conventional methodologies often rely on fragmented grid-based techniques, which may overlook critical interdependencies among spatial elements and temporal continuity. These limitations can compromise analytical effectiveness in complex urban environments. To address these challenges, we propose PriorMotion, a data integration framework designed to systematically uncover movement patterns through driving dynamics analysis. Our approach combines multi-scale empirical observations with customized analytical tools to capture evolving spatial-temporal trends in urban traffic. Comprehensive evaluations demonstrate that PriorMotion significantly enhances analytical outcomes, including increased accuracy in traffic pattern analysis, improved adaptability to heterogeneous data environments, and reduced long-term projection errors. Validation confirms its effectiveness for urban infrastructure management applications requiring precise characterization of complex spatial-temporal interactions.
CVApr 14, 2025
CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map UpdatesAnkit Kumar Shaw, Kun Jiang, Tuopu Wen et al. · tsinghua
The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/
CVApr 8, 2025
POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point CloudYining Shi, Kun Jiang, Xin Zhao et al. · tsinghua
LiDAR-based 3D object detection is a fundamental task in the field of autonomous driving. This paper explores the unique advantage of Frequency Modulated Continuous Wave (FMCW) LiDAR in autonomous perception. Given a single frame FMCW point cloud with radial velocity measurements, we expect that our object detector can detect the short-term future locations of objects using only the current frame sensor data and demonstrate a fast ability to respond to intermediate danger. To achieve this, we extend the standard object detection task to a novel task named predictive object detection (POD), which aims to predict the short-term future location and dimensions of objects based solely on current observations. Typically, a motion prediction task requires historical sensor information to process the temporal contexts of each object, while our detector's avoidance of multi-frame historical information enables a much faster response time to potential dangers. The core advantage of FMCW LiDAR lies in the radial velocity associated with every reflected point. We propose a novel POD framework, the core idea of which is to generate a virtual future point using a ray casting mechanism, create virtual two-frame point clouds with the current and virtual future frames, and encode these two-frame voxel features with a sparse 4D encoder. Subsequently, the 4D voxel features are separated by temporal indices and remapped into two Bird's Eye View (BEV) features: one decoded for standard current frame object detection and the other for future predictive object detection. Extensive experiments on our in-house dataset demonstrate the state-of-the-art standard and predictive detection performance of the proposed POD framework.
CVJun 11, 2024
EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous DrivingYining Shi, Kun Jiang, Jinyu Miao et al.
3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and data inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38\% of the performance (mIoU = 28.38) of a 100\% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40\% labeled sequences and distillation from the fusion-based OccNet.
CVJun 11, 2024
PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous DrivingYining Shi, Jiusi Li, Kun Jiang et al.
Vision-centric occupancy networks, which represent the surrounding environment with uniform voxels with semantics, have become a new trend for safe driving of camera-only autonomous driving perception systems, as they are able to detect obstacles regardless of their shape and occlusion. Modern occupancy networks mainly focus on reconstructing visible voxels from object surfaces with voxel-wise semantic prediction. Usually, they suffer from inconsistent predictions of one object and mixed predictions for adjacent objects. These confusions may harm the safety of downstream planning modules. To this end, we investigate panoptic segmentation on 3D voxel scenarios and propose an instance-aware occupancy network, PanoSSC. We predict foreground objects and backgrounds separately and merge both in post-processing. For foreground instance grouping, we propose a novel 3D instance mask decoder that can efficiently extract individual objects. we unify geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into PanoSSC framework and propose new metrics for evaluating panoptic voxels. Extensive experiments show that our method achieves competitive results on SemanticKITTI semantic scene completion benchmark.
SDDec 17, 2021
Soundify: Matching Sound Effects to VideoDavid Chuan-En Lin, Anastasis Germanidis, Cristóbal Valenzuela et al.
In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.
LGDec 14, 2021
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed FrameworkXupeng Miao, Hailin Zhang, Yining Shi et al.
Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.