CVSep 6, 2023
Bandwidth-efficient Inference for Neural Image CompressionShanzhi Yin, Tongda Xu, Yongsheng Liang et al.
With neural networks growing deeper and feature maps growing larger, limited communication bandwidth with external memory (or DRAM) and power constraints become a bottleneck in implementing network inference on mobile and edge devices. In this paper, we propose an end-to-end differentiable bandwidth efficient neural inference method with the activation compressed by neural data compression method. Specifically, we propose a transform-quantization-entropy coding pipeline for activation compression with symmetric exponential Golomb coding and a data-dependent Gaussian entropy model for arithmetic coding. Optimized with existing model quantization methods, low-level task of image compression can achieve up to 19x bandwidth reduction with 6.21x energy saving.
IVOct 11, 2024Code
Beyond GFVC: A Progressive Face Video Compression Framework with Adaptive Visual TokensBolin Chen, Shanzhi Yin, Zihan Zhang et al.
Recently, deep generative models have greatly advanced the progress of face video coding towards promising rate-distortion performance and diverse application functionalities. Beyond traditional hybrid video coding paradigms, Generative Face Video Compression (GFVC) relying on the strong capabilities of deep generative models and the philosophy of early Model-Based Coding (MBC) can facilitate the compact representation and realistic reconstruction of visual face signal, thus achieving ultra-low bitrate face video communication. However, these GFVC algorithms are sometimes faced with unstable reconstruction quality and limited bitrate ranges. To address these problems, this paper proposes a novel Progressive Face Video Compression framework, namely PFVC, that utilizes adaptive visual tokens to realize exceptional trade-offs between reconstruction robustness and bandwidth intelligence. In particular, the encoder of the proposed PFVC projects the high-dimensional face signal into adaptive visual tokens in a progressive manner, whilst the decoder can further reconstruct these adaptive visual tokens for motion estimation and signal synthesis with different granularity levels. Experimental results demonstrate that the proposed PFVC framework can achieve better coding flexibility and superior rate-distortion performance in comparison with the latest Versatile Video Coding (VVC) codec and the state-of-the-art GFVC algorithms. The project page can be found at https://github.com/Berlin0610/PFVC.
IVOct 14, 2024Code
Generative Human Video Compression with Multi-granularity Temporal Trajectory FactorizationShanzhi Yin, Bolin Chen, Shiqi Wang et al.
In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed motion factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into a fine-grained field for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF.
CVOct 13, 2024Code
Tokenizing Motion: A Generative Approach for Scene Dynamics CompressionShanzhi Yin, Zihan Zhang, Bolin Chen et al.
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
CVFeb 3, 2024
Generative Visual Compression: A ReviewBolin Chen, Shanzhi Yin, Peilin Chen et al.
Artificial Intelligence Generated Content (AIGC) is leading a new technical revolution for the acquisition of digital content and impelling the progress of visual compression towards competitive performance gains and diverse functionalities over traditional codecs. This paper provides a thorough review on the recent advances of generative visual compression, illustrating great potentials and promising applications in ultra-low bitrate communication, user-specified reconstruction/filtering, and intelligent machine analysis. In particular, we review the visual data compression methodologies with deep generative models, and summarize how compact representation and high-fidelity reconstruction could be actualized via generative techniques. In addition, we generalize related generative compression technologies for machine vision and intelligent analytics. Finally, we discuss the fundamental challenges on generative visual compression techniques and envision their future research directions.
CVOct 19, 2024
Standardizing Generative Face Video Compression using Supplemental Enhancement InformationBolin Chen, Yan Ye, Jie Chen et al.
This paper proposes a Generative Face Video Compression (GFVC) approach using Supplemental Enhancement Information (SEI), where a series of compact spatial and temporal representations of a face video signal (e.g., 2D/3D keypoints, facial semantics and compact features) can be coded using SEI messages and inserted into the coded video bitstream. At the time of writing, the proposed GFVC approach using SEI messages has been included into a draft amendment of the Versatile Supplemental Enhancement Information (VSEI) standard by the Joint Video Experts Team (JVET) of ISO/IEC JTC 1/SC 29 and ITU-T SG21, which will be standardized as a new version of ITU-T H.274 | ISO/IEC 23002-7. To the best of the authors' knowledge, the JVET work on the proposed SEI-based GFVC approach is the first standardization activity for generative video compression. The proposed SEI approach has not only advanced the reconstruction quality of early-day Model-Based Coding (MBC) via the state-of-the-art generative technique, but also established a new SEI definition for future GFVC applications and deployment. Experimental results illustrate that the proposed SEI-based GFVC approach can achieve remarkable rate-distortion performance compared with the latest Versatile Video Coding (VVC) standard, whilst also potentially enabling a wide variety of functionalities including user-specified animation/filtering and metaverse-related applications.
CVJun 9, 2025
Generative Models at the Frontier of Compression: A Survey on Generative Face Video CodingBolin Chen, Shanzhi Yin, Goluck Konuko et al.
The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.
CVFeb 24, 2025
Pleno-Generation: A Scalable Generative Face Video Compression Framework with Bandwidth IntelligenceBolin Chen, Hanwei Zhu, Shanzhi Yin et al.
Generative model based compact video compression is typically operated within a relative narrow range of bitrates, and often with an emphasis on ultra-low rate applications. There has been an increasing consensus in the video communication industry that full bitrate coverage should be enabled by generative coding. However, this is an extremely difficult task, largely because generation and compression, although related, have distinct goals and trade-offs. The proposed Pleno-Generation (PGen) framework distinguishes itself through its exceptional capabilities in ensuring the robustness of video coding by utilizing a wider range of bandwidth for generation via bandwidth intelligence. In particular, we initiate our research of PGen with face video coding, and PGen offers a paradigm shift that prioritizes high-fidelity reconstruction over pursuing compact bitstream. The novel PGen framework leverages scalable representation and layered reconstruction for Generative Face Video Compression (GFVC), in an attempt to imbue the bitstream with intelligence in different granularity. Experimental results illustrate that the proposed PGen framework can facilitate existing GFVC algorithms to better deliver high-fidelity and faithful face videos. In addition, the proposed framework can allow a greater space of flexibility for coding applications and show superior RD performance with a much wider bitrate range in terms of various quality evaluations. Moreover, in comparison with the latest Versatile Video Coding (VVC) codec, the proposed scheme achieves competitive Bjøntegaard-delta-rate savings for perceptual-level evaluations.
IVMay 22, 2025
Compressing Human Body Video with Interactive Semantics: A Generative ApproachBolin Chen, Shanzhi Yin, Hanwei Zhu et al.
In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.
IVOct 12, 2025
Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided FrameworkShanzhi Yin, Bolin Chen, Xinju Wu et al.
This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.
CVSep 27, 2025
Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex PredictionBolin Chen, Ru-Ling Liao, Yan Ye et al.
For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.
CVAug 19, 2025
A Lightweight Dual-Mode Optimization for Generative Face Video CodingZihan Zhang, Shanzhi Yin, Bolin Chen et al.
Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization -- combining architectural redesign and operational refinement -- to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.
IVFeb 9, 2022
Exploring Structural Sparsity in Neural Image CompressionShanzhi Yin, Chao Li, Wen Tan et al.
Neural image compression have reached or out-performed traditional methods (such as JPEG, BPG, WebP). However,their sophisticated network structures with cascaded convolution layers bring heavy computational burden for practical deployment. In this paper, we explore the structural sparsity in neural image compression network to obtain real-time acceleration without any specialized hardware design or algorithm. We propose a simple plug-in adaptive binary channel masking(ABCM) to judge the importance of each convolution channel and introduce sparsity during training. During inference, the unimportant channels are pruned to obtain slimmer network and less computation. We implement our method into three neural image compression networks with different entropy models to verify its effectiveness and generalization, the experiment results show that up to 7x computation reduction and 3x acceleration can be achieved with negligible performance drop.
IVNov 18, 2021
Universal Efficient Variable-rate Neural Image CompressionShanzhi Yin, Chao Li, Youneng Bao et al.
Recently, Learning-based image compression has reached comparable performance with traditional image codecs(such as JPEG, BPG, WebP). However, computational complexity and rate flexibility are still two major challenges for its practical deployment. To tackle these problems, this paper proposes two universal modules named Energy-based Channel Gating(ECG) and Bit-rate Modulator(BM), which can be directly embedded into existing end-to-end image compression models. ECG uses dynamic pruning to reduce FLOPs for more than 50\% in convolution layers, and a BM pair can modulate the latent representation to control the bit-rate in a channel-wise manner. By implementing these two modules, existing learning-based image codecs can obtain ability to output arbitrary bit-rate with a single model and reduced computation.