ROMay 6
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion DrivingHuimin Wang, Yue Wang, Bihao Cui et al.
We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.
CVMar 13, 2023
OSIS: Efficient One-stage Network for 3D Instance SegmentationChuan Tang, Xi Yang
Current 3D instance segmentation models generally use multi-stage methods to extract instance objects, including clustering, feature extraction, and post-processing processes. However, these multi-stage approaches rely on hyperparameter settings and hand-crafted processes, which restrict the inference speed of the model. In this paper, we propose a new 3D point cloud instance segmentation network, named OSIS. OSIS is a one-stage network, which directly segments instances from 3D point cloud data using neural network. To segment instances directly from the network, we propose an instance decoder, which decodes instance features from the network into instance segments. Our proposed OSIS realizes the end-to-end training by bipartite matching, therefore, our network does not require computationally expensive post-processing steps such as non maximum suppression (NMS) and clustering during inference. The results show that our network finally achieves excellent performance in the commonly used indoor scene instance segmentation dataset, and the inference speed of our network is only an average of 138ms per scene, which substantially exceeds the previous fastest method.
CVJul 5, 2021Code
Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching between Parts and WordsChuan Tang, Xi Yang, Bojian Wu et al.
Shape-Text matching is an important task of high-level shape understanding. Current methods mainly represent a 3D shape as multiple 2D rendered views, which obviously can not be understood well due to the structural ambiguity caused by self-occlusion in the limited number of views. To resolve this issue, we directly represent 3D shapes as point clouds, and propose to learn joint embedding of point clouds and texts by bidirectional matching between parts from shapes and words from texts. Specifically, we first segment the point clouds into parts, and then leverage optimal transport method to match parts and words in an optimized feature space, where each part is represented by aggregating features of all points within it and each word is abstracted by its contextual information. We optimize the feature space in order to enlarge the similarities between the paired training samples, while simultaneously maximizing the margin between the unpaired ones. Experiments demonstrate that our method achieves a significant improvement in accuracy over the SOTAs on multi-modal retrieval tasks under the Text2Shape dataset. Codes are available at https://github.com/JLUtangchuan/Parts2Words.
CVSep 5, 2023
AI Mobile Application for Archaeological Dating of Bronze DingsChuntao Li, Ruihua Qi, Chuan Tang et al.
We develop an AI application for archaeological dating of bronze Dings. A classification model is employed to predict the period of the input Ding, and a detection model is used to show the feature parts for making a decision of archaeological dating. To train the two deep learning models, we collected a large number of Ding images from published materials, and annotated the period and the feature parts on each image by archaeological experts. Furthermore, we design a user system and deploy our pre-trained models based on the platform of WeChat Mini Program for ease of use. Only need a smartphone installed WeChat APP, users can easily know the result of intelligent archaeological dating, the feature parts, and other reference artifacts, by taking a photo of a bronze Ding. To use our application, please scan this QR code by WeChat.