MMAug 12, 2024
Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live StreamingXinqi Jin, Zhui Zhu, Xikai Sun et al.
Neural enhancement through super-resolution (SR) deep neural networks (DNNs) opens up new possibilities for ultra-high-definition (UHD) live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on carefully selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Two novel techniques are incorporated into Palantir to select the most beneficial anchor patches and support latency-sensitive UHD live streaming applications. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate SR quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the naive method of applying DNN-based SR on all the frames, Palantir can reduce the SR DNN inference overhead by 20 times (or 60 times) while preserving 54.0-82.6% (or 32.8-64.0%) of the quality gain. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir can reduce the SR DNN inference overhead by 80.1% at most (and 38.4% on average) without sacrificing the video quality.
CVMay 19
EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial ReasoningPengtao Ma, Ziliang Zhou, Ciyu Ruan et al.
First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.
ROMar 29, 2025Code
Event Camera Meets Resource-Aware Mobile Computing: Abstraction, Algorithm, Acceleration, ApplicationHaoyang Wang, Ruishan Guo, Pengtao Ma et al.
With the increasing complexity of mobile device applications, these devices are evolving toward high agility. This shift imposes new demands on mobile sensing, particularly in achieving high-accuracy and low-latency. Event-based vision has emerged as a disruptive paradigm, offering high temporal resolution and low latency, making it well-suited for high-accuracy and low-latency sensing tasks on high-agility platforms. However, the presence of substantial noisy events, lack of stable, persistent semantic information, and large data volume pose challenges for event-based data processing on resource-constrained mobile devices. This paper surveys the literature from 2014 to 2025 and presents a comprehensive overview of event-based mobile sensing, encompassing its fundamental principles, event \textit{abstraction} methods, \textit{algorithm} advancements, and both hardware and software \textit{acceleration} strategies. We discuss key \textit{applications} of event cameras in mobile sensing, including visual odometry, object tracking, optical flow, and 3D reconstruction, while highlighting challenges associated with event data processing, sensor fusion, and real-time deployment. Furthermore, we outline future research directions, such as improving the event camera with advanced optics, leveraging neuromorphic computing for efficient processing, and integrating bio-inspired algorithms. To support ongoing research, we provide an open-source \textit{Online Sheet} with recent developments. We hope this survey serves as a reference, facilitating the adoption of event-based vision across diverse applications.
CVMar 17
$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge SpaceRuishan Guo, Ciyu Ruan, Haoyang Wang et al.
Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.
LGApr 14, 2024
RF-Diffusion: Radio Signal Generation via Time-Frequency DiffusionGuoxuan Chi, Zheng Yang, Chenshu Wu et al. · tsinghua
Along with AIGC shines in CV and NLP, its potential in the wireless domain has also emerged in recent years. Yet, existing RF-oriented generative solutions are ill-suited for generating high-quality, time-series RF data due to limited representation capabilities. In this work, inspired by the stellar achievements of the diffusion model in CV and NLP, we adapt it to the RF domain and propose RF-Diffusion. To accommodate the unique characteristics of RF signals, we first introduce a novel Time-Frequency Diffusion theory to enhance the original diffusion model, enabling it to tap into the information within the time, frequency, and complex-valued domains of RF signals. On this basis, we propose a Hierarchical Diffusion Transformer to translate the theory into a practical generative DNN through elaborated design spanning network architecture, functional block, and complex-valued operator, making RF-Diffusion a versatile solution to generate diverse, high-quality, and time-series RF data. Performance comparison with three prevalent generative models demonstrates the RF-Diffusion's superior performance in synthesizing Wi-Fi and FMCW signals. We also showcase the versatility of RF-Diffusion in boosting Wi-Fi sensing systems and performing channel estimation in 5G networks.
ROFeb 20, 2025
Ultra-High-Frequency Harmony: mmWave Radar and Event Camera Orchestrate Accurate Drone LandingHaoyang Wang, Jingao Xu, Xinyu Luo et al.
For precise, efficient, and safe drone landings, ground platforms should real-time, accurately locate descending drones and guide them to designated spots. While mmWave sensing combined with cameras improves localization accuracy, the lower sampling frequency of traditional frame cameras compared to mmWave radar creates bottlenecks in system throughput. In this work, we replace the traditional frame camera with event camera, a novel sensor that harmonizes in sampling frequency with mmWave radar within the ground platform setup, and introduce mmE-Loc, a high-precision, low-latency ground localization system designed for drone landings. To fully leverage the \textit{temporal consistency} and \textit{spatial complementarity} between these modalities, we propose two innovative modules, \textit{consistency-instructed collaborative tracking} and \textit{graph-informed adaptive joint optimization}, for accurate drone measurement extraction and efficient sensor fusion. Extensive real-world experiments in landing scenarios from a leading drone delivery company demonstrate that mmE-Loc outperforms state-of-the-art methods in both localization accuracy and latency.
CVMay 8, 2025
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera DerainingCiyu Ruan, Ruishan Guo, Zihang Gong et al.
Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions.
CVMay 8, 2025
EDmamba: Rethinking Efficient Event Denoising with Spatiotemporal Decoupled SSMsCiyu Ruan, Zihang Gong, Ruishan Guo et al.
Event cameras provide micro-second latency and broad dynamic range, yet their raw streams are marred by spatial artifacts (e.g., hot pixels) and temporally inconsistent background activity. Existing methods jointly process the entire 4D event volume (x, y, p, t), forcing heavy spatio-temporal attention that inflates parameters, FLOPs, and latency. We introduce EDmamba, a compact event-denoising framework that embraces the key insight that spatial and temporal noise arise from different physical mechanisms and can therefore be suppressed independently. A polarity- and geometry-aware encoder first extracts coarse cues, which are then routed to two lightweight state-space branches: a Spatial-SSM that learns location-conditioned filters to silence persistent artifacts, and a Temporal-SSM that models causal signal dynamics to eliminate bursty background events. This decoupled design distills the network to only 88.9K parameters and 2.27GFLOPs, enabling real-time throughput of 100K events in 68ms on a single GPU, 36x faster than recent Transformer baselines. Despite its economy, EDmamba establishes new state-of-the-art accuracy on four public benchmarks, outscoring the strongest prior model by 2.1 percentage points.
CLOct 17, 2025
Exemplar-Guided Planing: Enhanced LLM Agent for KGQAJingao Xu, Shuoyoucheng Ma, Xin Song et al.
Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.
CVJul 5, 2025
VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial SupervisionKezhong Liu, Yiwen Zhou, Mozi Chen et al.
This work proposes a mmWave radar's scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.
SEJul 1, 2025
iPanda: An LLM-based Agent for Automated Conformance Testing of Communication ProtocolsXikai Sun, Fan Dang, Shiqi Jiang et al.
Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes retrieval-augmented generation and customized CoT strategy to effectively interpret the implementation and produce executable test programs. To further enhance programs' quality, iPanda incorporates an iterative optimization mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-program generation by factors ranging from 4.675 times to 10.751 times.