ARMar 28Code
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUsJinwu Yang, Jiaan Wu, Zedong Liu et al.
The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.
CVNov 23, 2023
Progressive Learning with Visual Prompt Tuning for Variable-Rate Image CompressionShiyu Qin, Yimin Zhou, Jinpeng Wang et al.
In this paper, we propose a progressive learning paradigm for transformer-based variable-rate image compression. Our approach covers a wide range of compression rates with the assistance of the Layer-adaptive Prompt Module (LPM). Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively, which are fed as additional information into the Swin Transformer layer of a pre-trained transformer-based image compression model to affect the allocation of attention region and the bits, which in turn changes the target compression ratio of the model. To ensure the network is more lightweight, we involves the integration of prompt networks with less convolutional layers. Exhaustive experiments show that compared to methods based on multiple models, which are optimized separately for different target rates, the proposed method arrives at the same performance with 80% savings in parameter storage and 90% savings in datasets. Meanwhile, our model outperforms all current variable bitrate image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed bitrate image compression methods trained from scratch.
CVNov 23, 2023
Perceptual Image Compression with Cooperative Cross-Modal Side InformationShiyu Qin, Bin Chen, Yujun Huang et al.
The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.
IVMay 24, 2024Code
MambaVC: Learned Visual Compression with Selective State SpacesShiyu Qin, Jinpeng Wang, Yimin Zhou et al.
Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.
CVSep 6, 2024
3D-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric PriorsYujun Huang, Bin Chen, Niu Lian et al.
Existing multi-view image compression methods often rely on 2D projection-based similarities between views to estimate disparities. While effective for small disparities, such as those in stereo images, these methods struggle with the more complex disparities encountered in wide-baseline multi-camera systems, commonly found in virtual reality and autonomous driving applications. To address this limitation, we propose 3D-LMVIC, a novel learning-based multi-view image compression framework that leverages 3D Gaussian Splatting to derive geometric priors for accurate disparity estimation. Furthermore, we introduce a depth map compression model to minimize geometric redundancy across views, along with a multi-view sequence ordering strategy based on a defined distance measure between views to enhance correlations between adjacent views. Experimental results demonstrate that 3D-LMVIC achieves superior performance compared to both traditional and learning-based methods. Additionally, it significantly improves disparity estimation accuracy over existing two-view approaches.
CVDec 2, 2025
InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent CollaborationZhongyu Yang, Yingfang Yuan, Xuanming Jiang et al.
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
IVJan 23, 2024
An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity DecodingXiang Liu, Jiahong Chen, Bin Chen et al.
Displaying high-quality images on edge devices, such as augmented reality devices, is essential for enhancing the user experience. However, these devices often face power consumption and computing resource limitations, making it challenging to apply many deep learning-based image compression algorithms in this field. Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models: low computational complexity and parameter-free decoding. It also outperforms many traditional and early neural compression methods in terms of quality. In this study, we introduce a new Mixed AutoRegressive Model (MARM) to significantly reduce the decoding time for the current INR codec, along with a new synthesis network to enhance reconstruction quality. MARM includes our proposed AutoRegressive Upsampler (ARU) blocks, which are highly computationally efficient, and ARM from previous work to balance decoding time and reconstruction quality. We also propose enhancing ARU's performance using a checkerboard two-stage decoding strategy. Moreover, the ratio of different modules can be adjusted to maintain a balance between quality and speed. Comprehensive experiments demonstrate that our method significantly improves computational efficiency while preserving image quality. With different parameter settings, our method can achieve over a magnitude acceleration in decoding time without industrial level optimization, or achieve state-of-the-art reconstruction quality compared with other INR codecs. To the best of our knowledge, our method is the first INR-based codec comparable with Hyperprior in both decoding speed and quality while maintaining low complexity.
CVMay 15, 2025
High Quality Underwater Image Compression with Adaptive Correction and Codebook-based AugmentationYimin Zhou, Yichong Xia, Sicheng Pan et al.
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.
CVJan 19, 2024
MB-RACS: Measurement-Bounds-based Rate-Adaptive Image Compressed Sensing NetworkYujun Huang, Bin Chen, Naiqi Li et al.
Conventional compressed sensing (CS) algorithms typically apply a uniform sampling rate to different image blocks. A more strategic approach could be to allocate the number of measurements adaptively, based on each image block's complexity. In this paper, we propose a Measurement-Bounds-based Rate-Adaptive Image Compressed Sensing Network (MB-RACS) framework, which aims to adaptively determine the sampling rate for each image block in accordance with traditional measurement bounds theory. Moreover, since in real-world scenarios statistical information about the original image cannot be directly obtained, we suggest a multi-stage rate-adaptive sampling strategy. This strategy sequentially adjusts the sampling ratio allocation based on the information gathered from previous samplings. We formulate the multi-stage rate-adaptive sampling as a convex optimization problem and address it using a combination of Newton's method and binary search techniques. Additionally, we enhance our decoding process by incorporating skip connections between successive iterations to facilitate a richer transmission of feature information across iterations. Our experiments demonstrate that the proposed MB-RACS method surpasses current leading methods, with experimental evidence also underscoring the effectiveness of each module within our proposed framework.