Jianru Xue

CV
h-index17
36papers
3,210citations
Novelty48%
AI Score53

36 Papers

CVDec 19, 2022Code
Cognitive Accident Prediction in Driving Scenes: A Multimodality Benchmark

Jianwu Fang, Lei-Lei Li, Kuan Yang et al.

Traffic accident prediction in driving videos aims to provide an early warning of the accident occurrence, and supports the decision making of safe driving systems. Previous works usually concentrate on the spatial-temporal correlation of object-level context, while they do not fit the inherent long-tailed data distribution well and are vulnerable to severe environmental change. In this work, we propose a Cognitive Accident Prediction (CAP) method that explicitly leverages human-inspired cognition of text description on the visual observation and the driver attention to facilitate model training. In particular, the text description provides a dense semantic description guidance for the primary context of the traffic scene, while the driver attention provides a traction to focus on the critical region closely correlating with safe driving. CAP is formulated by an attentive text-to-vision shift fusion module, an attentive scene context transfer module, and the driver attention guided accident prediction module. We leverage the attention mechanism in these modules to explore the core semantic cues for accident prediction. In order to train CAP, we extend an existing self-collected DADA-2000 dataset (with annotated driver attention for each frame) with further factual text descriptions for the visual observations before the accidents. Besides, we construct a new large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames (named as CAP-DATA) together with labeled fact-effect-reason-introspection description and temporal accident frame label. Based on extensive experiments, the superiority of CAP is validated compared with state-of-the-art approaches. The code, CAP-DATA, and all results will be released in \url{https://github.com/JWFanggit/LOTVS-CAP}.

ROApr 7, 2023Code
UAV Obstacle Avoidance by Human-in-the-Loop Reinforcement in Arbitrary 3D Environment

Xuyang Li, Jianwu Fang, Kai Du et al.

This paper focuses on the continuous control of the unmanned aerial vehicle (UAV) based on a deep reinforcement learning method for a large-scale 3D complex environment. The purpose is to make the UAV reach any target point from a certain starting point, and the flying height and speed are variable during navigation. In this work, we propose a deep reinforcement learning (DRL)-based method combined with human-in-the-loop, which allows the UAV to avoid obstacles automatically during flying. We design multiple reward functions based on the relevant domain knowledge to guide UAV navigation. The role of human-in-the-loop is to dynamically change the reward function of the UAV in different situations to suit the obstacle avoidance of the UAV better. We verify the success rate and average step size on urban, rural, and forest scenarios, and the experimental results show that the proposed method can reduce the training convergence time and improve the efficiency and accuracy of navigation tasks. The code is available on the website https://github.com/Monnalo/UAV_navigation.

CVNov 1, 2022
Behavioral Intention Prediction in Driving Scenes: A Survey

Jianwu Fang, Fan Wang, Jianru Xue et al.

In the driving scene, the road agents usually conduct frequent interactions and intention understanding of the surroundings. Ego-agent (each road agent itself) predicts what behavior will be engaged by other road users all the time and expects a shared and consistent understanding for safe movement. Behavioral Intention Prediction (BIP) simulates such a human consideration process and fulfills the early prediction of specific behaviors. Similar to other prediction tasks, such as trajectory prediction, data-driven deep learning methods have taken the primary pipeline in research. The rapid development of BIP inevitably leads to new issues and challenges. To catalyze future research, this work provides a comprehensive review of BIP from the available datasets, key factors and challenges, pedestrian-centric and vehicle-centric BIP approaches, and BIP-aware applications. Based on the investigation, data-driven deep learning approaches have become the primary pipelines. The behavioral intention types are still monotonous in most current datasets and methods (e.g., Crossing (C) and Not Crossing (NC) for pedestrians and Lane Changing (LC) for vehicles) in this field. In addition, for the safe-critical scenarios (e.g., near-crashing situations), current research is limited. Through this investigation, we identify open issues in behavioral intention prediction and suggest possible insights for future research.

CVAug 1, 2023Code
Gated Driver Attention Predictor

Tianci Zhao, Xue Bai, Jianwu Fang et al.

Driver attention prediction implies the intention understanding of where the driver intends to go and what object the driver concerned about, which commonly provides a driving task-guided traffic scene understanding. Some recent works explore driver attention prediction in critical or accident scenarios and find a positive role in helping accident prediction, while the promotion ability is constrained by the prediction accuracy of driver attention maps. In this work, we explore the network connection gating mechanism for driver attention prediction (Gate-DAP). Gate-DAP aims to learn the importance of different spatial, temporal, and modality information in driving scenarios with various road types, occasions, and light and weather conditions. The network connection gating in Gate-DAP consists of a spatial encoding network gating, long-short-term memory network gating, and information type gating modules. Each connection gating operation is plug-and-play and can be flexibly assembled, which makes the architecture of Gate-DAP transparent for evaluating different spatial, temporal, and information types for driver attention prediction. Evaluations on DADA-2000 and BDDA datasets verify the superiority of the proposed method with the comparison with state-of-the-art approaches. The code is available on https://github.com/JWFangit/Gate-DAP.

CVMar 29, 2023
FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-tail Trajectory Prediction

Yuning Wang, Pu Zhang, Lei Bai et al.

Predicting the future trajectories of the traffic agents is a gordian technique in autonomous driving. However, trajectory prediction suffers from data imbalance in the prevalent datasets, and the tailed data is often more complicated and safety-critical. In this paper, we focus on dealing with the long-tail phenomenon in trajectory prediction. Previous methods dealing with long-tail data did not take into account the variety of motion patterns in the tailed data. In this paper, we put forward a future enhanced contrastive learning framework to recognize tail trajectory patterns and form a feature space with separate pattern clusters. Furthermore, a distribution aware hyper predictor is brought up to better utilize the shaped feature space. Our method is a model-agnostic framework and can be plugged into many well-known baselines. Experimental results show that our framework outperforms the state-of-the-art long-tail prediction method on tailed samples by 9.5% on ADE and 8.5% on FDE, while maintaining or slightly improving the averaged performance. Our method also surpasses many long-tail techniques on trajectory prediction task.

AIAug 30, 2023
Vision-Based Traffic Accident Detection and Anticipation: A Survey

Jianwu Fang, iahuan Qiao, Jianru Xue et al.

Traffic accident detection and anticipation is an obstinate road safety problem and painstaking efforts have been devoted. With the rapid growth of video data, Vision-based Traffic Accident Detection and Anticipation (named Vision-TAD and Vision-TAA) become the last one-mile problem for safe driving and surveillance safety. However, the long-tailed, unbalanced, highly dynamic, complex, and uncertain properties of traffic accidents form the Out-of-Distribution (OOD) feature for Vision-TAD and Vision-TAA. Current AI development may focus on these OOD but important problems. What has been done for Vision-TAD and Vision-TAA? What direction we should focus on in the future for this problem? A comprehensive survey is important. We present the first survey on Vision-TAD in the deep learning era and the first-ever survey for Vision-TAA. The pros and cons of each research prototype are discussed in detail during the investigation. In addition, we also provide a critical review of 31 publicly available benchmarks and related evaluation metrics. Through this survey, we want to spawn new insights and open possible trends for Vision-TAD and Vision-TAA tasks.

CVOct 10, 2022
Using Detection, Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios

Xingyu Chen, Jianru Xue, Jianwu Fang et al.

In this paper, we propose a lightweight system, RDS-SLAM, based on ORB-SLAM2, which can accurately estimate poses and build semantic maps at object level for dynamic scenarios in real time using only one commonly used Intel Core i7 CPU. In RDS-SLAM, three major improvements, as well as major architectural modifications, are proposed to overcome the limitations of ORB-SLAM2. Firstly, it adopts a lightweight object detection neural network in key frames. Secondly, an efficient tracking and prediction mechanism is embedded into the system to remove the feature points belonging to movable objects in all incoming frames. Thirdly, a semantic octree map is built by probabilistic fusion of detection and tracking results, which enables a robot to maintain a semantic description at object level for potential interactions in dynamic scenarios. We evaluate RDS-SLAM in TUM RGB-D dataset, and experimental results show that RDS-SLAM can run with 30.3 ms per frame in dynamic scenarios using only an Intel Core i7 CPU, and achieves comparable accuracy compared with the state-of-the-art SLAM systems which heavily rely on both Intel Core i7 CPUs and powerful GPUs.

CVNov 2, 2022
Heterogeneous Trajectory Forecasting via Risk and Scene Graph Learning

Jianwu Fang, Chen Zhu, Pu Zhang et al.

Heterogeneous trajectory forecasting is critical for intelligent transportation systems, but it is challenging because of the difficulty of modeling the complex interaction relations among the heterogeneous road agents as well as their agent-environment constraints. In this work, we propose a risk and scene graph learning method for trajectory forecasting of heterogeneous road agents, which consists of a Heterogeneous Risk Graph (HRG) and a Hierarchical Scene Graph (HSG) from the aspects of agent category and their movable semantic regions. HRG groups each kind of road agent and calculates their interaction adjacency matrix based on an effective collision risk metric. HSG of the driving scene is modeled by inferring the relationship between road agents and road semantic layout aligned by the road scene grammar. Based on this formulation, we can obtain effective trajectory forecasting in driving situations, and superior performance to other state-of-the-art approaches is demonstrated by exhaustive experiments on the nuScenes, ApolloScape, and Argoverse datasets.

CVNov 2, 2022
Deep Virtual-to-Real Distillation for Pedestrian Crossing Prediction

Jie Bai, Xin Fang, Jianwu Fang et al.

Pedestrian crossing is one of the most typical behavior which conflicts with natural driving behavior of vehicles. Consequently, pedestrian crossing prediction is one of the primary task that influences the vehicle planning for safe driving. However, current methods that rely on the practically collected data in real driving scenes cannot depict and cover all kinds of scene condition in real traffic world. To this end, we formulate a deep virtual to real distillation framework by introducing the synthetic data that can be generated conveniently, and borrow the abundant information of pedestrian movement in synthetic videos for the pedestrian crossing prediction in real data with a simple and lightweight implementation. In order to verify this framework, we construct a benchmark with 4667 virtual videos owning about 745k frames (called Virtual-PedCross-4667), and evaluate the proposed method on two challenging datasets collected in real driving situations, i.e., JAAD and PIE datasets. State-of-the-art performance of this framework is demonstrated by exhaustive experiment analysis. The dataset and code can be downloaded from the website \url{http://www.lotvs.net/code_data/}.

CVOct 10, 2022
Sparse Semantic Map-Based Monocular Localization in Traffic Scenes Using Learned 2D-3D Point-Line Correspondences

Xingyu Chen, Jianru Xue, Shanmin Pang

Vision-based localization in a prior map is of crucial importance for autonomous vehicles. Given a query image, the goal is to estimate the camera pose corresponding to the prior map, and the key is the registration problem of camera images within the map. While autonomous vehicles drive on the road under occlusion (e.g., car, bus, truck) and changing environment appearance (e.g., illumination changes, seasonal variation), existing approaches rely heavily on dense point descriptors at the feature level to solve the registration problem, entangling features with appearance and occlusion. As a result, they often fail to estimate the correct poses. To address these issues, we propose a sparse semantic map-based monocular localization method, which solves 2D-3D registration via a well-designed deep neural network. Given a sparse semantic map that consists of simplified elements (e.g., pole lines, traffic sign midpoints) with multiple semantic labels, the camera pose is then estimated by learning the corresponding features between the 2D semantic elements from the image and the 3D elements from the sparse semantic map. The proposed sparse semantic map-based localization approach is robust against occlusion and long-term appearance changes in the environments. Extensive experimental results show that the proposed method outperforms the state-of-the-art approaches.

LGFeb 2Code
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

Jinbin Bai, Yixuan Li, Yuchen Zhu et al.

Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (iii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our Prism achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code is released at https://github.com/viiika/Prism.

CVJun 25, 2023
Enhancing Mapless Trajectory Prediction through Knowledge Distillation

Yuning Wang, Pu Zhang, Lei Bai et al.

Scene information plays a crucial role in trajectory forecasting systems for autonomous driving by providing semantic clues and constraints on potential future paths of traffic agents. Prevalent trajectory prediction techniques often take high-definition maps (HD maps) as part of the inputs to provide scene knowledge. Although HD maps offer accurate road information, they may suffer from the high cost of annotation or restrictions of law that limits their widespread use. Therefore, those methods are still expected to generate reliable prediction results in mapless scenarios. In this paper, we tackle the problem of improving the consistency of multi-modal prediction trajectories and the real road topology when map information is unavailable during the test phase. Specifically, we achieve this by training a map-based prediction teacher network on the annotated samples and transferring the knowledge to a student mapless prediction network using a two-fold knowledge distillation framework. Our solution is generalizable for common trajectory prediction networks and does not bring extra computation burden. Experimental results show that our method stably improves prediction performance in mapless mode on many widely used state-of-the-art trajectory prediction baselines, compensating for the gaps caused by the absence of HD maps. Qualitative visualization results demonstrate that our approach helps infer unseen map information.

CVAug 24, 2024
Gating Syn-to-Real Knowledge for Pedestrian Crossing Prediction in Safe Driving

Jie Bai, Jianwu Fang, Yisheng Lv et al.

Pedestrian Crossing Prediction (PCP) in driving scenes plays a critical role in ensuring the safe operation of intelligent vehicles. Due to the limited observations of pedestrian crossing behaviors in typical situations, recent studies have begun to leverage synthetic data with flexible variation to boost prediction performance, employing domain adaptation frameworks. However, different domain knowledge has distinct cross-domain distribution gaps, which necessitates suitable domain knowledge adaption ways for PCP tasks. In this work, we propose a Gated Syn-to-Real Knowledge transfer approach for PCP (Gated-S2R-PCP), which has two aims: 1) designing the suitable domain adaptation ways for different kinds of crossing-domain knowledge, and 2) transferring suitable knowledge for specific situations with gated knowledge fusion. Specifically, we design a framework that contains three domain adaption methods including style transfer, distribution approximation, and knowledge distillation for various information, such as visual, semantic, depth, location, etc. A Learnable Gated Unit (LGU) is employed to fuse suitable cross-domain knowledge to boost pedestrian crossing prediction. We construct a new synthetic benchmark S2R-PCP-3181 with 3181 sequences (489,740 frames) which contains the pedestrian locations, RGB frames, semantic images, and depth images. With the synthetic S2R-PCP-3181, we transfer the knowledge to two real challenging datasets of PIE and JAAD, and superior PCP performance is obtained to the state-of-the-art methods.

ROApr 10, 2023
Learning Residual Model of Model Predictive Control via Random Forests for Autonomous Driving

Kang Zhao, Jianru Xue, Xiangning Meng et al.

One major issue in learning-based model predictive control (MPC) for autonomous driving is the contradiction between the system model's prediction accuracy and computation efficiency. The more situations a system model covers, the more complex it is, along with highly nonlinear and nonconvex properties. These issues make the optimization too complicated to solve and render real-time control impractical.To address these issues, we propose a hierarchical learning residual model which leverages random forests and linear regression.The learned model consists of two levels. The low level uses linear regression to fit the residues, and the high level uses random forests to switch different linear models. Meanwhile, we adopt the linear dynamic bicycle model with error states as the nominal model.The switched linear regression model is added to the nominal model to form the system model. It reformulates the learning-based MPC as a quadratic program (QP) problem and optimization solvers can effectively solve it. Experimental path tracking results show that the driving vehicle's prediction accuracy and tracking accuracy are significantly improved compared with the nominal MPC.Compared with the state-of-the-art Gaussian process-based nonlinear model predictive control (GP-NMPC), our method gets better performance on tracking accuracy while maintaining a lower computation consumption.

49.7CVMar 10
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating

Yuning Wang, Pu Zhang, Yuan He et al.

Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a meta-learning framework to directly optimize the predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism enables the online learning rate to suit the test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior adaptation accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.

CVDec 3, 2024Code
Redundant Queries in DETR-Based 3D Detection Methods: Unnecessary and Prunable

Lizhen Xu, Zehao Wu, Wenzhao Qiu et al.

Query-based models are extensively used in 3D object detection tasks, with a wide range of pre-trained checkpoints readily available online. However, despite their popularity, these models often require an excessive number of object queries, far surpassing the actual number of objects to detect. The redundant queries result in unnecessary computational and memory costs. In this paper, we find that not all queries contribute equally -- a significant portion of queries have a much smaller impact compared to others. Based on this observation, we propose an embarrassingly simple approach called Gradually Pruning Queries (GPQ), which prunes queries incrementally based on their classification scores. A key advantage of GPQ is that it requires no additional learnable parameters. It is straightforward to implement in any query-based method, as it can be seamlessly integrated as a fine-tuning step using an existing checkpoint after training. With GPQ, users can easily generate multiple models with fewer queries, starting from a checkpoint with an excessive number of queries. Experiments on various advanced 3D detectors show that GPQ effectively reduces redundant queries while maintaining performance. Using our method, model inference on desktop GPUs can be accelerated by up to 1.35x. Moreover, after deployment on edge devices, it achieves up to a 67.86% reduction in FLOPs and a 65.16% decrease in inference time. The code will be available at https://github.com/iseri27/Gpq.

CVNov 3, 2024Code
HeightMapNet: Explicit Height Modeling for End-to-End HD Map Learning

Wenzhao Qiu, Shanmin Pang, Hao zhang et al.

Recent advances in high-definition (HD) map construction from surround-view images have highlighted their cost-effectiveness in deployment. However, prevailing techniques often fall short in accurately extracting and utilizing road features, as well as in the implementation of view transformation. In response, we introduce HeightMapNet, a novel framework that establishes a dynamic relationship between image features and road surface height distributions. By integrating height priors, our approach refines the accuracy of Bird's-Eye-View (BEV) features beyond conventional methods. HeightMapNet also introduces a foreground-background separation network that sharply distinguishes between critical road elements and extraneous background components, enabling precise focus on detailed road micro-features. Additionally, our method leverages multi-scale features within the BEV space, optimally utilizing spatial geometric information to boost model performance. HeightMapNet has shown exceptional results on the challenging nuScenes and Argoverse 2 datasets, outperforming several widely recognized approaches. The code will be available at \url{https://github.com/adasfag/HeightMapNet/}.

CVMar 5, 2025Code
IC-Mapper: Instance-Centric Spatio-Temporal Modeling for Online Vectorized Map Construction

Jiangtong Zhu, Zhao Yang, Yinan Shi et al.

Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose IC-Mapper, an instance-centric online mapping framework, which comprises two primary components: 1) Instance-centric temporal association module: For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) Instance-centric spatial fusion module: We perform point sampling on the historical global map from a spatial dimension and integrate it with the detection results of instances corresponding to the current frame to achieve real-time expansion and update of the map. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods. Code will be released on https://github.com/Brickzhuantou/IC-Mapper.

AIFeb 13, 2024Code
Vehicle Behavior Prediction by Episodic-Memory Implanted NDT

Peining Shen, Jianwu Fang, Hongkai Yu et al.

In autonomous driving, predicting the behavior (turning left, stopping, etc.) of target vehicles is crucial for the self-driving vehicle to make safe decisions and avoid accidents. Existing deep learning-based methods have shown excellent and accurate performance, but the black-box nature makes it untrustworthy to apply them in practical use. In this work, we explore the interpretability of behavior prediction of target vehicles by an Episodic Memory implanted Neural Decision Tree (abbrev. eMem-NDT). The structure of eMem-NDT is constructed by hierarchically clustering the text embedding of vehicle behavior descriptions. eMem-NDT is a neural-backed part of a pre-trained deep learning model by changing the soft-max layer of the deep model to eMem-NDT, for grouping and aligning the memory prototypes of the historical vehicle behavior features in training data on a neural decision tree. Each leaf node of eMem-NDT is modeled by a neural network for aligning the behavior memory prototypes. By eMem-NDT, we infer each instance in behavior prediction of vehicles by bottom-up Memory Prototype Matching (MPM) (searching the appropriate leaf node and the links to the root node) and top-down Leaf Link Aggregation (LLA) (obtaining the probability of future behaviors of vehicles for certain instances). We validate eMem-NDT on BLVD and LOKI datasets, and the results show that our model can obtain a superior performance to other methods with clear explainability. The code is available at https://github.com/JWFangit/eMem-NDT.

CVDec 18, 2019Code
DADA: Driver Attention Prediction in Driving Accident Scenarios

Jianwu Fang, Dingxin Yan, Jiahuan Qiao et al.

Driver attention prediction is becoming an essential research problem in human-like driving systems. This work makes an attempt to predict the driver attention in driving accident scenarios (DADA). However, challenges tread on the heels of that because of the dynamic traffic scene, intricate and imbalanced accident categories. In this work, we design a semantic context induced attentive fusion network (SCAFNet). We first segment the RGB video frames into the images with different semantic regions (i.e., semantic images), where each region denotes one kind of semantic categories of the scene (e.g., road, trees, etc.), and learn the spatio-temporal features of RGB frames and semantic images in two parallel paths simultaneously. Then, the learned features are fused by an attentive fusion network to find the semantic-induced scene variation in driver attention prediction. The contributions are three folds. 1) With the semantic images, we introduce their semantic context features and verified the manifest promotion effect for helping the driver attention prediction, where the semantic context features are modeled by a graph convolution network (GCN) on semantic images; 2) We fuse the semantic context features of semantic images and the features of RGB frames in an attentive strategy, and the fused details are transferred over frames by a convolutional LSTM module to obtain the attention map of each video frame with the consideration of historical scene variation in driving situations; 3) The superiority of the proposed method is evaluated on our previously collected dataset (named as DADA-2000) and two other challenging datasets with state-of-the-art methods. DADA-2000 is available at https://github.com/JWFangit/LOTVS-DADA.

CVApr 2, 2019Code
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition

Pengfei Zhang, Cuiling Lan, Wenjun Zeng et al.

Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data. Recently, there is a trend of using very deep feedforward neural networks to model the 3D coordinates of joints without considering the computational efficiency. In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition. We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability. In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a framelevel module for modeling the dependencies of frames by taking the joints in the same frame as a whole. A strong baseline is proposed to facilitate the study of this field. With an order of magnitude smaller model size than most previous works, SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets. The source code is available at https://github.com/microsoft/SGN.

CVMar 15, 2019Code
BLVD: Building A Large-scale 5D Semantics Benchmark for Autonomous Driving

Jianru Xue, Jianwu Fang, Tao Li et al.

In autonomous driving community, numerous benchmarks have been established to assist the tasks of 3D/2D object detection, stereo vision, semantic/instance segmentation. However, the more meaningful dynamic evolution of the surrounding objects of ego-vehicle is rarely exploited, and lacks a large-scale dataset platform. To address this, we introduce BLVD, a large-scale 5D semantics benchmark which does not concentrate on the static detection or semantic/instance segmentation tasks tackled adequately before. Instead, BLVD aims to provide a platform for the tasks of dynamic 4D (3D+temporal) tracking, 5D (4D+interactive) interactive event recognition and intention prediction. This benchmark will boost the deeper understanding of traffic scenes than ever before. We totally yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime). The benchmark can be downloaded from our project site https://github.com/VCCIV/BLVD/.

CVApr 20, 2018Code
View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition

Pengfei Zhang, Cuiling Lan, Junliang Xing et al.

Skeleton-based human action recognition has recently attracted increasing attention thanks to the accessibility and the popularity of 3D skeleton data. One of the key challenges in skeleton-based action recognition lies in the large view variations when capturing data. In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints in a learning based data driven manner. We design two view adaptive neural networks, i.e., VA-RNN based on RNN, and VA-CNN based on CNN. For each network, a novel view adaptation module learns and determines the most suitable observation viewpoints, and transforms the skeletons to those viewpoints for the end-to-end recognition with a main classification network. Ablation studies find that the proposed view adaptive models are capable of transforming the skeletons of various viewpoints to much more consistent virtual viewpoints which largely eliminates the viewpoint influence. In addition, we design a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the fused prediction. Extensive experimental evaluations on five challenging benchmarks demonstrate that the effectiveness of the proposed view-adaptive networks and superior performance over state-of-the-art approaches. The source code is available at https://github.com/microsoft/View-Adaptive-Neural-Networks-for-Skeleton-based-Human-Action-Recognition.

CVMar 1, 2024
Abductive Ego-View Accident Video Understanding for Safe Driving Perception

Jianwu Fang, Lei-lei Li, Junfei Zhou et al.

We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.

CVJun 29, 2025
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Lei-lei Li, Jianwu Fang, Junbin Xiao et al.

Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.

MMMar 16, 2025
EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis

Jianwu Fang, Lei-Lei Li, Zhedong Zheng et al.

Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.

CVFeb 3, 2022
Trajectory Forecasting from Detection with Uncertainty-Aware Motion Encoding

Pu Zhang, Lei Bai, Jianru Xue et al.

Trajectory forecasting is critical for autonomous platforms to make safe planning and actions. Currently, most trajectory forecasting methods assume that object trajectories have been extracted and directly develop trajectory predictors based on the ground truth trajectories. However, this assumption does not hold in practical situations. Trajectories obtained from object detection and tracking are inevitably noisy, which could cause serious forecasting errors to predictors built on ground truth trajectories. In this paper, we propose a trajectory predictor directly based on detection results without relying on explicitly formed trajectories. Different from the traditional methods which encode the motion cue of an agent based on its clearly defined trajectory, we extract the motion information only based on the affinity cues among detection results, in which an affinity-aware state update mechanism is designed to take the uncertainty of association into account. In addition, considering that there could be multiple plausible matching candidates, we aggregate the states of them. This design relaxes the undesirable effect of noisy trajectory obtained from data association. Extensive ablation experiments validate the effectiveness of our method and its generalization ability on different detectors. Cross-comparison to other forecasting schemes further proves the superiority of our method. Code will be released upon acceptance.

CVNov 7, 2021
Multi-Scale Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition

Pengfei Zhang, Cuiling Lan, Wenjun Zeng et al.

Skeleton data is of low dimension. However, there is a trend of using very deep and complicated feedforward neural networks to model the skeleton sequence without considering the complexity in recent year. In this paper, a simple yet effective multi-scale semantics-guided neural network (MS-SGN) is proposed for skeleton-based action recognition. We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability of joints. Moreover, a multi-scale strategy is proposed to be robust to the temporal scale variations. In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a frame-level module for modeling the temporal dependencies of frames. With an order of magnitude smaller model size than most previous methods, MSSGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.

CVMar 25, 2020
DCDLearn: Multi-order Deep Cross-distance Learning for Vehicle Re-Identification

Rixing Zhu, Jianwu Fang, Hongke Xu et al.

Vehicle re-identification (Re-ID) has become a popular research topic owing to its practicability in intelligent transportation systems. Vehicle Re-ID suffers the numerous challenges caused by drastic variation in illumination, occlusions, background, resolutions, viewing angles, and so on. To address it, this paper formulates a multi-order deep cross-distance learning (\textbf{DCDLearn}) model for vehicle re-identification, where an efficient one-view CycleGAN model is developed to alleviate exhaustive and enumerative cross-camera matching problem in previous works and smooth the domain discrepancy of cross cameras. Specially, we treat the transferred images and the reconstructed images generated by one-view CycleGAN as multi-order augmented data for deep cross-distance learning, where the cross distances of multi-order image set with distinct identities are learned by optimizing an objective function with multi-order augmented triplet loss and center loss to achieve the camera-invariance and identity-consistency. Extensive experiments on three vehicle Re-ID datasets demonstrate that the proposed method achieves significant improvement over the state-of-the-arts, especially for the small scale dataset.

CVSep 3, 2019
EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks

Pengfei Zhang, Jianru Xue, Cuiling Lan et al.

Recurrent neural networks (RNNs) are capable of modeling temporal dependencies of complex sequential data. In general, current available structures of RNNs tend to concentrate on controlling the contributions of current and previous information. However, the exploration of different importance levels of different elements within an input vector is always ignored. We propose a simple yet effective Element-wise-Attention Gate (EleAttG), which can be easily added to an RNN block (e.g. all RNN neurons in an RNN layer), to empower the RNN neurons to have attentiveness capability. For an RNN block, an EleAttG is used for adaptively modulating the input by assigning different levels of importance, i.e., attention, to each element/dimension of the input. We refer to an RNN block equipped with an EleAttG as an EleAtt-RNN block. Instead of modulating the input as a whole, the EleAttG modulates the input at fine granularity, i.e., element-wise, and the modulation is content adaptive. The proposed EleAttG, as an additional fundamental unit, is general and can be applied to any RNN structures, e.g., standard RNN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). We demonstrate the effectiveness of the proposed EleAtt-RNN by applying it to different tasks including the action recognition, from both skeleton-based data and RGB videos, gesture recognition, and sequential MNIST classification. Experiments show that adding attentiveness through EleAttGs to RNN blocks significantly improves the power of RNNs.

CVApr 23, 2019
DADA-2000: Can Driving Accident be Predicted by Driver Attention? Analyzed by A Benchmark

Jianwu Fang, Dingxin Yan, Jiahuan Qiao et al.

Driver attention prediction is currently becoming the focus in safe driving research community, such as the DR(eye)VE project and newly emerged Berkeley DeepDrive Attention (BDD-A) database in critical situations. In safe driving, an essential task is to predict the incoming accidents as early as possible. BDD-A was aware of this problem and collected the driver attention in laboratory because of the rarity of such scenes. Nevertheless, BDD-A focuses the critical situations which do not encounter actual accidents, and just faces the driver attention prediction task, without a close step for accident prediction. In contrast to this, we explore the view of drivers' eyes for capturing multiple kinds of accidents, and construct a more diverse and larger video benchmark than ever before with the driver attention and the driving accident annotation simultaneously (named as DADA-2000), which has 2000 video clips owning about 658,476 frames on 54 kinds of accidents. These clips are crowd-sourced and captured in various occasions (highway, urban, rural, and tunnel), weather (sunny, rainy and snowy) and light conditions (daytime and nighttime). For the driver attention representation, we collect the maps of fixations, saccade scan path and focusing time. The accidents are annotated by their categories, the accident window in clips and spatial locations of the crash-objects. Based on the analysis, we obtain a quantitative and positive answer for the question in this paper.

CVMar 7, 2019
SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction

Pu Zhang, Wanli Ouyang, Pengfei Zhang et al.

In crowd scenarios, reliable trajectory prediction of pedestrians requires insightful understanding of their social behaviors. These behaviors have been well investigated by plenty of studies, while it is hard to be fully expressed by hand-craft rules. Recent studies based on LSTM networks have shown great ability to learn social behaviors. However, many of these methods rely on previous neighboring hidden states but ignore the important current intention of the neighbors. In order to address this issue, we propose a data-driven state refinement module for LSTM network (SR-LSTM), which activates the utilization of the current intention of neighbors, and jointly and iteratively refines the current states of all participants in the crowd through a message passing mechanism. To effectively extract the social effect of neighbors, we further introduce a social-aware information selection mechanism consisting of an element-wise motion gate and a pedestrian-wise attention to select useful message from neighboring pedestrians. Experimental results on two public datasets, i.e. ETH and UCY, demonstrate the effectiveness of our proposed SR-LSTM and we achieves state-of-the-art results.

CVJul 12, 2018
Adding Attentiveness to the Neurons in Recurrent Neural Networks

Pengfei Zhang, Jianru Xue, Cuiling Lan et al.

Recurrent neural networks (RNNs) are capable of modeling the temporal dynamics of complex sequential information. However, the structures of existing RNN neurons mainly focus on controlling the contributions of current and historical information but do not explore the different importance levels of different elements in an input vector of a time slot. We propose adding a simple yet effective Element-wiseAttention Gate (EleAttG) to an RNN block (e.g., all RNN neurons in a network layer) that empowers the RNN neurons to have the attentiveness capability. For an RNN block, an EleAttG is added to adaptively modulate the input by assigning different levels of importance, i.e., attention, to each element/dimension of the input. We refer to an RNN block equipped with an EleAttG as an EleAtt-RNN block. Specifically, the modulation of the input is content adaptive and is performed at fine granularity, being element-wise rather than input-wise. The proposed EleAttG, as an additional fundamental unit, is general and can be applied to any RNN structures, e.g., standard RNN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). We demonstrate the effectiveness of the proposed EleAtt-RNN by applying it to the action recognition tasks on both 3D human skeleton data and RGB videos. Experiments show that adding attentiveness through EleAttGs to RNN blocks significantly boosts the power of RNNs.

IRMay 22, 2018
Deep Feature Aggregation and Image Re-ranking with Heat Diffusion for Image Retrieval

Shanmin Pang, Jin Ma, Jianru Xue et al.

Image retrieval based on deep convolutional features has demonstrated state-of-the-art performance in popular benchmarks. In this paper, we present a unified solution to address deep convolutional feature aggregation and image re-ranking by simulating the dynamics of heat diffusion. A distinctive problem in image retrieval is that repetitive or \emph{bursty} features tend to dominate final image representations, resulting in representations less distinguishable. We show that by considering each deep feature as a heat source, our unsupervised aggregation method is able to avoid over-representation of \emph{bursty} features. We additionally provide a practical solution for the proposed aggregation method and further show the efficiency of our method in experimental evaluation. Inspired by the aforementioned deep feature aggregation method, we also propose a method to re-rank a number of top ranked images for a given query image by considering the query as the heat source. Finally, we extensively evaluate the proposed approach with pre-trained and fine-tuned deep networks on common public benchmarks and show superior performance compared to previous work.

CVMar 24, 2017
View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data

Pengfei Zhang, Cuiling Lan, Junliang Xing et al.

Skeleton-based human action recognition has recently attracted increasing attention due to the popularity of 3D skeleton data. One main challenge lies in the large view variations in captured human actions. We propose a novel view adaptation scheme to automatically regulate observation viewpoints during the occurrence of an action. Rather than re-positioning the skeletons based on a human defined prior criterion, we design a view adaptive recurrent neural network (RNN) with LSTM architecture, which enables the network itself to adapt to the most suitable observation viewpoints from end to end. Extensive experiment analyses show that the proposed view adaptive RNN model strives to (1) transform the skeletons of various views to much more consistent viewpoints and (2) maintain the continuity of the action rather than transforming every frame to the same position with the same body orientation. Our model achieves significant improvement over the state-of-the-art approaches on three benchmark datasets.

CVApr 5, 2016
Counting Grid Aggregation for Event Retrieval and Recognition

Zhanning Gao, Gang Hua, Dongqing Zhang et al.

Event retrieval and recognition in a large corpus of videos necessitates a holistic fixed-size visual representation at the video clip level that is comprehensive, compact, and yet discriminative. It shall comprehensively aggregate information across relevant video frames, while suppress redundant information, leading to a compact representation that can effectively differentiate among different visual events. In search for such a representation, we propose to build a spatially consistent counting grid model to aggregate together deep features extracted from different video frames. The spatial consistency of the counting grid model is achieved by introducing a prior model estimated from a large corpus of video data. The counting grid model produces an intermediate tensor representation for each video, which automatically identifies and removes the feature redundancy across the different frames. The tensor representation is subsequently reduced to a fixed-size vector representation by averaging over the counting grid. When compared to existing methods on both event retrieval and event classification benchmarks, we achieve significantly better accuracy with much more compact representation.