57.8CVMay 30
MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory ForecastingYu Liu, Ming Huang, Xiao Ren et al.
Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.
CVMar 7, 2023
Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event StreamsBochen Xie, Yongjian Deng, Zhanpeng Shao et al.
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S$^{2}$TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
CVSep 15, 2022
A Temporal Densely Connected Recurrent Network for Event-based Human Pose EstimationZhanpeng Shao, Wen Zhou, Wuzhen Wang et al.
Event camera is an emerging bio-inspired vision sensors that report per-pixel brightness changes asynchronously. It holds noticeable advantage of high dynamic range, high speed response, and low power budget that enable it to best capture local motions in uncontrolled environments. This motivates us to unlock the potential of event cameras for human pose estimation, as the human pose estimation with event cameras is rarely explored. Due to the novel paradigm shift from conventional frame-based cameras, however, event signals in a time interval contain very limited information, as event cameras can only capture the moving body parts and ignores those static body parts, resulting in some parts to be incomplete or even disappeared in the time interval. This paper proposes a novel densely connected recurrent architecture to address the problem of incomplete information. By this recurrent architecture, we can explicitly model not only the sequential but also non-sequential geometric consistency across time steps to accumulate information from previous frames to recover the entire human bodies, achieving a stable and accurate human pose estimation from event data. Moreover, to better evaluate our model, we collect a large scale multimodal event-based dataset that comes with human pose annotations, which is by far the most challenging one to the best of our knowledge. The experimental results on two public datasets and our own dataset demonstrate the effectiveness and strength of our approach. Code can be available online for facilitating the future research.
CVFeb 8, 2023
A Dynamic Graph CNN with Cross-Representation Distillation for Event-Based RecognitionYongjian Deng, Hao Chen, Bochen Xie et al.
Recent advances in event-based research prioritize sparsity and temporal precision. Approaches using dense frame-based representations processed via well-pretrained CNNs are being replaced by the use of sparse point-based representations learned through graph CNNs (GCN). Yet, the efficacy of these graph methods is far behind their frame-based counterparts with two limitations. ($i$) Biased graph construction without carefully integrating variant attributes ($i.e.$, semantics, spatial and temporal cues) for each vertex, leading to imprecise graph representation. ($ii$) Deficient learning because of the lack of well-pretrained models available. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.
CVMar 19, 2023
ANMS: Asynchronous Non-Maximum Suppression in Event StreamQianang Zhou, JunLin Xiong, Youfu Li
The non-maximum suppression (NMS) is widely used in frame-based tasks as an essential post-processing algorithm. However, event-based NMS either has high computational complexity or leads to frequent discontinuities. As a result, the performance of event-based corner detectors is limited. This paper proposes a general-purpose asynchronous non-maximum suppression pipeline (ANMS), and applies it to corner event detection. The proposed pipeline extract fine feature stream from the output of original detectors and adapts to the speed of motion. The ANMS runs directly on the asynchronous event stream with extremely low latency, which hardly affects the speed of original detectors. Additionally, we evaluate the DAVIS-based ground-truth labeling method to fill the gap between frame and event. Evaluation on public dataset indicates that the proposed ANMS pipeline significantly improves the performance of three classical asynchronous detectors with negligible latency. More importantly, the proposed ANMS framework is a natural extension of NMS, which is applicable to other asynchronous scoring tasks for event cameras.
CVJan 1, 2025
Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow EstimationQianang Zhou, Junhui Hou, Meiyi Yang et al.
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10\%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4\% accuracy gain and a 45\% reduction in inference time.
CVDec 12, 2024
ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion EstimationQianang Zhou, Zhiyu Zhu, Junhui Hou et al.
Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation. However, estimating event-based HTR optical flow faces two key challenges: the absence of HTR ground-truth data and the intrinsic sparsity of event data. Most existing approaches rely on the flow accumulation paradigms to indirectly supervise intermediate flows, often resulting in accumulation errors and optimization difficulties. To address these challenges, we propose a residual-based paradigm for estimating HTR optical flow with event data. Our approach separates HTR flow estimation into two stages: global linear motion estimation and HTR residual flow refinement. The residual paradigm effectively mitigates the impacts of event sparsity on optimization and is compatible with any LTR algorithm. Next, to address the challenge posed by the absence of HTR ground truth, we incorporate novel learning strategies. Specifically, we initially employ a shared refiner to estimate the residual flows, enabling both LTR supervision and HTR inference. Subsequently, we introduce regional noise to simulate the residual patterns of intermediate flows, facilitating the adaptation from LTR supervision to HTR inference. Additionally, we show that the noise-based strategy supports in-domain self-supervised training. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art accuracy in both LTR and HTR metrics, highlighting its effectiveness and superiority.
CVApr 28, 2024
Event-based Video Frame Interpolation with Edge Guided Motion RefinementYuhan Liu, Yongjian Deng, Hao Chen et al.
Video frame interpolation, the process of synthesizing intermediate frames between sequential video frames, has made remarkable progress with the use of event cameras. These sensors, with microsecond-level temporal resolution, fill information gaps between frames by providing precise motion cues. However, contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often neglect the fact that event data primarily supply high-confidence features at scene edges during multi-modal feature fusion, thereby diminishing the role of event signals in optical flow (OF) estimation and warping refinement. To address this overlooked aspect, we introduce an end-to-end E-VFI learning method (referred to as EGMR) to efficiently utilize edge features from event signals for motion flow and warping enhancement. Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation based on the local correlation of multi-modal features in a coarse-to-fine strategy. Moreover, given that event data can provide accurate visual references at scene edges between consecutive frames, we introduce a learned visibility map derived from event data to adaptively mitigate the occlusion problem in the warping refinement process. Extensive experiments on both synthetic and real datasets show the effectiveness of the proposed approach, demonstrating its potential for higher quality video frame interpolation.
CVJun 1, 2021
A Voxel Graph CNN for Object Classification with Event CamerasYongjian Deng, Hao Chen, Hai Liu et al.
Event cameras attract researchers' attention due to their low power consumption, high dynamic range, and extremely high temporal resolution. Learning models on event-based object classification have recently achieved massive success by accumulating sparse events into dense frames to apply traditional 2D learning methods. Yet, these approaches necessitate heavy-weight models and are with high computational complexity due to the redundant information introduced by the sparse-to-dense conversion, limiting the potential of event cameras on real-life applications. This study aims to address the core problem of balancing accuracy and model complexity for event-based classification models. To this end, we introduce a novel graph representation for event data to exploit their sparsity better and customize a lightweight voxel graph convolutional neural network (\textit{EV-VGCNN}) for event-based classification. Specifically, (1) using voxel-wise vertices rather than previous point-wise inputs to explicitly exploit regional 2D semantics of event streams while keeping the sparsity;(2) proposing a multi-scale feature relational layer (\textit{MFRL}) to extract spatial and motion cues from each vertex discriminatively concerning its distances to neighbors. Comprehensive experiments show that our model can advance state-of-the-art classification accuracy with extremely low model complexity (merely 0.84M parameters).
CVSep 20, 2019
CNN-based RGB-D Salient Object Detection: Learn, Select and FuseHao Chen, Youfu Li
The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which the well-learned source modality provides supervisory signals to facilitate the learning process for the new modality. To better extract the complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal interactions and cross-level transmissions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in zero-shot saliency detection and pre-training on a new modality, as well as the advantages in selecting and fusing cross-modal/cross-level complements.
LOSep 18, 2019
BigData Applications from Graph Analytics to Machine Learning by Aggregates in RecursionAriyam Das, Youfu Li, Jin Wang et al.
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore we can preserve the declarative formal semantics of such programs while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this paper, we show that with PreM, a wide spectrum of classical algorithms of practical interest, ranging from graph analytics and dynamic programming based optimization problems to data mining and machine learning applications can be concisely expressed in declarative languages by using aggregates in recursion. Our examples are also used to show that PreM can be checked using simple techniques and templatized verification strategies. A wide range of advanced BigData applications can now be expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability.
ROJun 18, 2016
RRV: A Spatiotemporal Descriptor for Rigid Body Motion RecognitionYao Guo, Youfu Li, Zhanpeng Shao
Motion behaviors of a rigid body can be characterized by a 6-dimensional motion trajectory, which contains position vectors of a reference point on the rigid body and rotations of this rigid body over time. This paper devises a Rotation and Relative Velocity (RRV) descriptor by exploring the local translational and rotational invariants of motion trajectories of rigid bodies, which is insensitive to noise, invariant to rigid transformation and scaling. A flexible metric is also introduced to measure the distance between two RRV descriptors. The RRV descriptor is then applied to characterize motions of a human body skeleton modeled as articulated interconnections of multiple rigid bodies. To illustrate the descriptive ability of the RRV descriptor, we explore it for different rigid body motion recognition tasks. The experimental results on benchmark datasets demonstrate that this simple RRV descriptor outperforms the previous ones regarding recognition accuracy without increasing computational cost.