93.7NEApr 12Code
SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from MambaYulong Huang, Jianxiong Tang, Chao Wang et al.
Large Language Models (LLMs) have achieved remarkable performance across tasks but remain energy-intensive due to dense matrix operations. Spiking neural networks (SNNs) improve energy efficiency by replacing dense matrix multiplications with sparse accumulations. Their sparse spike activity enables efficient LLMs deployment on edge devices. However, prior SNN-based LLMs often sacrifice performance for efficiency, and recovering accuracy typically requires full pretraining, which is costly and impractical. To address this, we propose SpikingMamba, an energy-efficient SNN-based LLMs distilled from Mamba that improves energy efficiency with minimal accuracy sacrifice. SpikingMamba integrates two key components: (a) SI-LIF, a signed-integer spiking neuron that preserves semantic polarity through signed multi-level spike representations. (b) A training-exclusive Smoothed Gradient Compensation (SGC) path mitigating quantization loss while preserving spike-driven efficiency. We employ a single-stage distillation strategy to transfer the zero-shot ability of pretrained Mamba and further enhance it via reinforcement learning (RL). Experiments show that SpikingMamba-1.3B achieves a 4.76$\times$ energy benefit, with only a 4.78\% zero-shot accuracy gap compared to the original Mamba. The model achieves a further 2.55\% accuracy improvement after RL, narrowing the performance gap from 4.78\% to 2.23\%. Code is available at: https://github.com/HuuYuLong/SpikingMamba .
88.6LGMay 7Code
MDN: Parallelizing Stepwise Momentum for Delta Linear AttentionYulong Huang, Xiang Liu, Hongxiang Huang et al.
Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .
CVOct 11, 2023
SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action RecognitionHongwei Ren, Yue Zhou, Yulong Huang et al.
Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3\% of the parameters and 0.5\% of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications.
CVAug 19, 2023
TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event CamerasHongwei Ren, Yue Zhou, Haotian Fu et al.
Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of events. Still, it typically cannot reach a comparable accuracy as the frame-based method. We propose a lightweight and generalized point cloud network called TTPOINT which achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks while only using 1.5 % of the computational resources. The model is adept at abstracting local and global geometry by hierarchy structure. By leveraging tensor-train compressed feature extractors, TTPOINT can be designed with minimal parameters and computational complexity. Additionally, we developed a straightforward downsampling algorithm to maintain the spatio-temporal feature. In the experiment, TTPOINT emerged as the SOTA method on three datasets while also attaining SOTA among point cloud methods on all five datasets. Moreover, by using the tensor-train decomposition method, the accuracy of the proposed TTPOINT is almost unaffected while compressing the parameter size by 55 % in all five datasets.
91.8CVMay 3Code
Exploring Data-Free LoRA Transferability for Video Diffusion ModelsYuchen Wang, Wenliang Zhong, Lichen Bai et al.
Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose Cluster-Aware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target's manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at https://github.com/Noahwangyuchen/CASA
CVMar 31, 2025Code
Exploring Temporal Dynamics in Event-based Eye TrackerHongwei Ren, Xiaopeng Lin, Hongxiang Huang et al.
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.
CVFeb 26
Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image GenerationDian Xie, Shitong Shao, Lichen Bai et al.
Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.
CVAug 18, 2025Code
SIS-Challenge: Event-based Spatio-temporal Instance Segmentation Challenge at the CVPR 2025 Event-based Vision WorkshopFriedhelm Hamann, Emil Mededovic, Fabian Gülhan et al.
We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md
CVMay 9, 2024Code
Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMambaHongwei Ren, Yue Zhou, Jiadong Zhu et al.
Event cameras draw inspiration from biological systems, boasting low latency and high dynamic range while consuming minimal power. The most current approach to processing Event Cloud often involves converting it into frame-based representations, which neglects the sparsity of events, loses fine-grained temporal information, and increases the computational burden. In contrast, Point Cloud is a popular representation for processing 3-dimensional data and serves as an alternative method to exploit local and global spatial features. Nevertheless, previous point-based methods show an unsatisfactory performance compared to the frame-based method in dealing with spatio-temporal event streams. In order to bridge the gap, we propose EventMamba, an efficient and effective framework based on Point Cloud representation by rethinking the distinction between Event Cloud and Point Cloud, emphasizing vital temporal information. The Event Cloud is subsequently fed into a hierarchical structure with staged modules to process both implicit and explicit temporal features. Specifically, we redesign the global extractor to enhance explicit temporal extraction among a long sequence of events with temporal aggregation and State Space Model (SSM) based Mamba. Our model consumes minimal computational resources in the experiments and still exhibits SOTA point-based performance on six different scales of action recognition datasets. It even outperformed all frame-based methods on both Camera Pose Relocalization (CPR) and eye-tracking regression tasks. Our code is available at: https://github.com/rhwxmx/EventMamba.
CVApr 17, 2024
Event-Based Eye Tracking. AIS 2024 Challenge SurveyZuowen Wang, Chang Gao, Zongwei Wu et al.
This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.
CVMar 28, 2024
A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose RelocalizationHongwei Ren, Jiadong Zhu, Yue Zhou et al.
Event cameras exhibit remarkable attributes such as high dynamic range, asynchronicity, and low latency, making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events, making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless, existing CPR networks based on events neglect the pivotal fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we introduce PEPNet, a simple and effective point-based network designed to regress six degrees of freedom (6-DOFs) event camera poses. We rethink the relationship between the event camera and CPR tasks, leveraging the raw Point Cloud directly as network input to harness the high-temporal resolution and inherent sparsity of events. PEPNet is adept at abstracting the spatial and implicit temporal features through hierarchical structure and explicit temporal features by Attentive Bi-directional Long Short-Term Memory (A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet delivers state-of-the-art (SOTA) performance on both indoor and outdoor datasets with meager computational resources. Specifically, PEPNet attains a significant 38% and 33% performance improvement on the random split IJRR and M3ED datasets, respectively. Moreover, the lightweight design version PEPNet$_{tiny}$ accomplishes results comparable to the SOTA while employing a mere 0.5% of the parameters.
CVApr 25, 2025
Event-Based Eye Tracking. 2025 Event-based Vision WorkshopQinyu Chen, Chang Gao, Min Liu et al.
This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.
CVJan 27, 2025
ClearSight: Human Vision-Inspired Solutions for Event-Based Motion DeblurringXiaopeng Lin, Yulong Huang, Hongwei Ren et al.
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjusts neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
CVDec 30, 2024
Frequency-aware Event Cloud NetworkHongwei Ren, Fei Ma, Xiaopeng Lin et al.
Event cameras are biologically inspired sensors that emit events asynchronously with remarkable temporal resolution, garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformation, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it ignores the polarity information, and its models have limited proficiency in abstracting long-term events' features. In this paper, we propose a frequency-aware network named FECNet that leverages Event Cloud representations. FECNet fully utilizes 2S-1T-1P Event Cloud by innovating the event-based Group and Sampling module. To accommodate the long sequence events from Event Cloud, FECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach substantially extinguishes the explosion of Multiply Accumulate Operations (MACs) while effectively abstracting spatial-temporal features. We conducted extensive experiments on event-based object classification, action recognition, and human pose estimation tasks, and the results substantiate the effectiveness and efficiency of FECNet.
CVDec 16, 2024
Event-based Motion Deblurring via Multi-Temporal Granularity FusionXiaopeng Lin, Hongwei Ren, Yulong Huang et al.
Conventional frame-based cameras inevitably produce blurry effects due to motion occurring during the exposure time. Event camera, a bio-inspired sensor offering continuous visual information could enhance the deblurring performance. Effectively utilizing the high-temporal-resolution event data is crucial for extracting precise motion information and enhancing deblurring performance. However, existing event-based image deblurring methods usually utilize voxel-based event representations, losing the fine-grained temporal details that are mathematically essential for fast motion deblurring. In this paper, we first introduce point cloud-based event representation into the image deblurring task and propose a Multi-Temporal Granularity Network (MTGNet). It combines the spatially dense but temporally coarse-grained voxel-based event representation and the temporally fine-grained but spatially sparse point cloud-based event. To seamlessly integrate such complementary representations, we design a Fine-grained Point Branch. An Aggregation and Mapping Module (AMM) is proposed to align the low-level point-based features with frame-based features and an Adaptive Feature Diffusion Module (AFDM) is designed to manage the resolution discrepancies between event data and image data by enriching the sparse point feature. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art approaches on both synthetic and real-world datasets.
CVJun 5, 2024
FAPNet: An Effective Frequency Adaptive Point-based Eye TrackerXiaopeng Lin, Hongwei Ren, Bojun Cheng
Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consumption and high temporal resolution. This positions them as an alternative to traditional cameras in the realm of eye tracking. Nevertheless, existing event-based eye tracking networks neglect the pivotal sparse and fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we utilize Point Cloud as the event representation to harness the high temporal resolution and sparse characteristics of events in eye tracking tasks. We rethink the point-based architecture PEPNet with preprocessing the long-term relationships between samples, leading to the innovative design of FAPNet. A frequency adaptive mechanism is designed to realize adaptive tracking according to the speed of the pupil movement and the Inter Sample LSTM module is introduced to utilize the temporal correlation between samples. In the Event-based Eye Tracking Challenge, we utilize vanilla PEPNet, which is the former work to achieve the $p_{10}$ accuracy of 97.95\%. On the SEET synthetic dataset, FAPNet can achieve state-of-the-art while consuming merely 10\% of the PEPNet's computational resources. Notably, the computational demand of FAPNet is independent of the sensor's spatial resolution, enhancing its applicability on resource-limited edge devices.