IVNov 7, 2022
Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: ReportAndrey Ignatov, Radu Timofte, Maurizio Denna et al.
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
IVNov 7, 2022
Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: ReportAndrey Ignatov, Radu Timofte, Cheng-Ming Chiang et al.
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
NESep 20, 2023Code
SpikingNeRF: Making Bio-inspired Neural Networks See through the Real WorldXingting Yao, Qinghao Hu, Fei Zhou et al.
In this paper, we propose SpikingNeRF, which aligns the temporal dimension of spiking neural networks (SNNs) with the radiance rays, to seamlessly accommodate SNNs to the reconstruction of neural radiance fields (NeRF). Thus, the computation turns into a spike-based, multiplication-free manner, reducing energy consumption and making high-quality 3D rendering, for the first time, accessible to neuromorphic hardware. In SpikingNeRF, each sampled point on the ray is matched to a particular time step and represented in a hybrid manner where the voxel grids are maintained as well. Based on the voxel grids, sampled points are determined whether to be masked out for faster training and inference. However, this masking operation also incurs irregular temporal length, making it intractable for hardware processors, e.g., GPUs, to conduct parallel training. To address this problem, we develop the temporal padding strategy to tackle the masked samples to maintain regular temporal length, i.e., regular tensors, and further propose the temporal condensing strategy to form a denser data structure for hardware-friendly computation. Experiments on various datasets demonstrate that our method can reduce energy consumption by an average of 70.79\% and obtain comparable synthesis quality with the ANN baseline. Verification on the neuromorphic hardware accelerator also shows that SpikingNeRF can further benefit from neuromorphic computing over the ANN baselines on energy efficiency. Codes and the appendix are in \url{https://github.com/Ikarosy/SpikingNeRF-of-CASIA}.
LGJan 30
TriSpec: Ternary Speculative Decoding via Lightweight Proxy VerificationHaoyun Jiang, Junqi He, Feng Hong et al.
Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.
CVDec 15, 2025
VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language InferenceShengling Qin, Hao Yu, Chenxin Wu et al.
This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. We develop an experimental implementation of the proposed VLCache pipeline based on SGLang, enabling significantly faster inference in practical deployments.