Yongjian Deng

CV
h-index14
12papers
142citations
Novelty52%
AI Score49

12 Papers

CVMar 7, 2023
Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

Bochen Xie, Yongjian Deng, Zhanpeng Shao et al.

Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S$^{2}$TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.

CVAug 19, 2023
Dissecting RGB-D Learning for Improved Multi-modal Fusion

Hao Chen, Haoran Zhou, Yunshu Zhang et al.

In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.

CVFeb 8, 2023
A Dynamic Graph CNN with Cross-Representation Distillation for Event-Based Recognition

Yongjian Deng, Hao Chen, Bochen Xie et al.

Recent advances in event-based research prioritize sparsity and temporal precision. Approaches using dense frame-based representations processed via well-pretrained CNNs are being replaced by the use of sparse point-based representations learned through graph CNNs (GCN). Yet, the efficacy of these graph methods is far behind their frame-based counterparts with two limitations. ($i$) Biased graph construction without carefully integrating variant attributes ($i.e.$, semantics, spatial and temporal cues) for each vertex, leading to imprecise graph representation. ($ii$) Deficient learning because of the lack of well-pretrained models available. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.

90.5LGMay 22
Balancing Multimodal Learning through Label Space Reshaping

Xiaoyu Ma, Weijie Zhang, Yuanhao Gao et al.

Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak modality or adjusting optimization gradients. However, such strategies mainly compensate for optimization rate discrepancies, often at the expense of the strong modality's optimization capacity, without analyzing how these discrepancies arise at the modality level. Based on theoretical insights and empirical observations, we argue that the discrepancy of learning pace arises from differences in the mapping difficulty between modality-specific feature space and the shared label space. To address this issue, we propose Balanced Multimodal Label Reshaping (BMLR), the first method that promotes multimodal balance from the label-side design. BMLR reshapes the cross-modal label space to equalize mapping difficulty across modalities, thereby facilitating modality interaction and injecting richer inter-class information into each modality. Extensive experiments across multiple architectures demonstrate that BMLR consistently improves multimodal performance and exhibits strong compatibility with diverse model designs. The source code will be released soon.

CVSep 18, 2024
EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning

Yukun Tian, Hao Chen, Yongjian Deng et al.

The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.

LGJun 13, 2025Code
Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Xiaoyu Ma, Hao Chen, Yongjian Deng

Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$\uparrow$ on CREMAD and 3.41%$\uparrow$ on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at https://github.com/MatthewMaxy/Remix_ICML2025.

CVOct 30, 2024Code
Prune and Repaint: Content-Aware Image Retargeting for any Ratio

Feihong Shen, Chao Li, Yifeng Geng et al.

Image retargeting is the task of adjusting the aspect ratio of images to suit different display devices or presentation environments. However, existing retargeting methods often struggle to balance the preservation of key semantics and image quality, resulting in either deformation or loss of important objects, or the introduction of local artifacts such as discontinuous pixels and inconsistent regenerated content. To address these issues, we propose a content-aware retargeting method called PruneRepaint. It incorporates semantic importance for each pixel to guide the identification of regions that need to be pruned or preserved in order to maintain key semantics. Additionally, we introduce an adaptive repainting module that selects image regions for repainting based on the distribution of pruned pixels and the proportion between foreground size and target aspect ratio, thus achieving local smoothness after pruning. By focusing on the content and structure of the foreground, our PruneRepaint approach adaptively avoids key content loss and deformation, while effectively mitigating artifacts with local repainting. We conduct experiments on the public RetargetMe benchmark and demonstrate through objective experimental results and subjective user studies that our method outperforms previous approaches in terms of preserving semantics and aesthetics, as well as better generalization across diverse aspect ratios. Codes will be available at https://github.com/fhshen2022/PruneRepaint.

CVMar 1, 2024
Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Zhenpeng Huang, Chao Li, Hao Chen et al.

In this paper, we present a new data-efficient voxel-based self-supervised learning method for event cameras. Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams. In order to make our pre-training data-efficient, we first design a semantic-uniform masking method to address the learning imbalance caused by the varying reconstruction difficulties of different regions in non-uniform data when using random masking. In addition, we ease the traditional hybrid masked modeling process by explicitly decomposing it into two branches, namely local spatio-temporal reconstruction and global semantic reconstruction to encourage the encoder to capture local correlations and global semantics, respectively. This decomposition allows our selfsupervised learning method to converge faster with minimal pre-training data. Compared to previous approaches, our self-supervised learning method does not rely on paired RGB images, yet enables simultaneous exploration of spatial and temporal cues in multiple scales. It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.

CVJan 1, 2025
Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation

Qianang Zhou, Junhui Hou, Meiyi Yang et al.

Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10\%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4\% accuracy gain and a 45\% reduction in inference time.

CVDec 12, 2024
ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation

Qianang Zhou, Zhiyu Zhu, Junhui Hou et al.

Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation. However, estimating event-based HTR optical flow faces two key challenges: the absence of HTR ground-truth data and the intrinsic sparsity of event data. Most existing approaches rely on the flow accumulation paradigms to indirectly supervise intermediate flows, often resulting in accumulation errors and optimization difficulties. To address these challenges, we propose a residual-based paradigm for estimating HTR optical flow with event data. Our approach separates HTR flow estimation into two stages: global linear motion estimation and HTR residual flow refinement. The residual paradigm effectively mitigates the impacts of event sparsity on optimization and is compatible with any LTR algorithm. Next, to address the challenge posed by the absence of HTR ground truth, we incorporate novel learning strategies. Specifically, we initially employ a shared refiner to estimate the residual flows, enabling both LTR supervision and HTR inference. Subsequently, we introduce regional noise to simulate the residual patterns of intermediate flows, facilitating the adaptation from LTR supervision to HTR inference. Additionally, we show that the noise-based strategy supports in-domain self-supervised training. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art accuracy in both LTR and HTR metrics, highlighting its effectiveness and superiority.

CVApr 28, 2024
Event-based Video Frame Interpolation with Edge Guided Motion Refinement

Yuhan Liu, Yongjian Deng, Hao Chen et al.

Video frame interpolation, the process of synthesizing intermediate frames between sequential video frames, has made remarkable progress with the use of event cameras. These sensors, with microsecond-level temporal resolution, fill information gaps between frames by providing precise motion cues. However, contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often neglect the fact that event data primarily supply high-confidence features at scene edges during multi-modal feature fusion, thereby diminishing the role of event signals in optical flow (OF) estimation and warping refinement. To address this overlooked aspect, we introduce an end-to-end E-VFI learning method (referred to as EGMR) to efficiently utilize edge features from event signals for motion flow and warping enhancement. Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation based on the local correlation of multi-modal features in a coarse-to-fine strategy. Moreover, given that event data can provide accurate visual references at scene edges between consecutive frames, we introduce a learned visibility map derived from event data to adaptively mitigate the occlusion problem in the warping refinement process. Extensive experiments on both synthetic and real datasets show the effectiveness of the proposed approach, demonstrating its potential for higher quality video frame interpolation.

CVJun 1, 2021
A Voxel Graph CNN for Object Classification with Event Cameras

Yongjian Deng, Hao Chen, Hai Liu et al.

Event cameras attract researchers' attention due to their low power consumption, high dynamic range, and extremely high temporal resolution. Learning models on event-based object classification have recently achieved massive success by accumulating sparse events into dense frames to apply traditional 2D learning methods. Yet, these approaches necessitate heavy-weight models and are with high computational complexity due to the redundant information introduced by the sparse-to-dense conversion, limiting the potential of event cameras on real-life applications. This study aims to address the core problem of balancing accuracy and model complexity for event-based classification models. To this end, we introduce a novel graph representation for event data to exploit their sparsity better and customize a lightweight voxel graph convolutional neural network (\textit{EV-VGCNN}) for event-based classification. Specifically, (1) using voxel-wise vertices rather than previous point-wise inputs to explicitly exploit regional 2D semantics of event streams while keeping the sparsity;(2) proposing a multi-scale feature relational layer (\textit{MFRL}) to extract spatial and motion cues from each vertex discriminatively concerning its distances to neighbors. Comprehensive experiments show that our model can advance state-of-the-art classification accuracy with extremely low model complexity (merely 0.84M parameters).