Chuanmin Jia

CV
h-index32
25papers
815citations
Novelty44%
AI Score55

25 Papers

CVNov 13, 2022Code
Perceptual Video Coding for Machines via Satisfied Machine Ratio Modeling

Qi Zhang, Shanshe Wang, Xinfeng Zhang et al.

Video Coding for Machines (VCM) aims to compress visual signals for machine analysis. However, existing methods only consider a few machines, neglecting the majority. Moreover, the machine's perceptual characteristics are not leveraged effectively, resulting in suboptimal compression efficiency. To overcome these limitations, this paper introduces Satisfied Machine Ratio (SMR), a metric that statistically evaluates the perceptual quality of compressed images and videos for machines by aggregating satisfaction scores from them. Each score is derived from machine perceptual differences between original and compressed images. Targeting image classification and object detection tasks, we build two representative machine libraries for SMR annotation and create a large-scale SMR dataset to facilitate SMR studies. We then propose an SMR prediction model based on the correlation between deep feature differences and SMR. Furthermore, we introduce an auxiliary task to increase the prediction accuracy by predicting the SMR difference between two images in different quality. Extensive experiments demonstrate that SMR models significantly improve compression performance for machines and exhibit robust generalizability on unseen machines, codecs, datasets, and frame types. SMR enables perceptual coding for machines and propels VCM from specificity to generality. Code is available at https://github.com/ywwynm/SMR.

CVJun 1
Edge-directed geometric partitioning for versatile video coding

Xuewei Meng, Xinfeng Zhang, Chuanmin Jia et al.

To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.

CVJun 1
Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang et al.

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

CVJun 1
Deformable Wiener Filter for Future Video Coding

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang et al.

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

IVSep 6, 2022
Cross Modal Compression: Towards Human-comprehensible Semantic Compression

Jiguo Li, Chuanmin Jia, Xinfeng Zhang et al.

Traditional image/video compression aims to reduce the transmission/storage cost with signal fidelity as high as possible. However, with the increasing demand for machine analysis and semantic monitoring in recent years, semantic fidelity rather than signal fidelity is becoming another emerging concern in image/video compression. With the recent advances in cross modal translation and generation, in this paper, we propose the cross modal compression~(CMC), a semantic compression framework for visual data, to transform the high redundant visual data~(such as image, video, etc.) into a compact, human-comprehensible domain~(such as text, sketch, semantic map, attributions, etc.), while preserving the semantic. Specifically, we first formulate the CMC problem as a rate-distortion optimization problem. Secondly, we investigate the relationship with the traditional image/video compression and the recent feature compression frameworks, showing the difference between our CMC and these prior frameworks. Then we propose a novel paradigm for CMC to demonstrate its effectiveness. The qualitative and quantitative results show that our proposed CMC can achieve encouraging reconstructed results with an ultrahigh compression ratio, showing better compression performance than the widely used JPEG baseline.

CVJun 25, 2023
SpikeCodec: An End-to-end Learned Compression Framework for Spiking Camera

Kexiang Feng, Chuanmin Jia, Siwei Ma et al.

Recently, the bio-inspired spike camera with continuous motion recording capability has attracted tremendous attention due to its ultra high temporal resolution imaging characteristic. Such imaging feature results in huge data storage and transmission burden compared to that of traditional camera, raising severe challenge and imminent necessity in compression for spike camera captured content. Existing lossy data compression methods could not be applied for compressing spike streams efficiently due to integrate-and-fire characteristic and binarized data structure. Considering the imaging principle and information fidelity of spike cameras, we introduce an effective and robust representation of spike streams. Based on this representation, we propose a novel learned spike compression framework using scene recovery, variational auto-encoder plus spike simulator. To our knowledge, it is the first data-trained model for efficient and robust spike stream compression. Extensive experimental results show that our method outperforms the conventional and learning-based codecs, contributing a strong baseline for learned spike data compression.

IVApr 14, 2023
Machine Perception-Driven Image Compression: A Layered Generative Approach

Yuefeng Zhang, Chuanmin Jia, Jiannhui Chang et al.

In this age of information, images are a critical medium for storing and transmitting information. With the rapid growth of image data amount, visual compression and visual data perception are two important research topics attracting a lot attention. However, those two topics are rarely discussed together and follow separate research path. Due to the compact compressed domain representation offered by learning-based image compression methods, there exists possibility to have one stream targeting both efficient data storage and compression, and machine perception tasks. In this paper, we propose a layered generative image compression model achieving high human vision-oriented image reconstructed quality, even at extreme compression ratios. To obtain analysis efficiency and flexibility, a task-agnostic learning-based compression model is proposed, which effectively supports various compressed domain-based analytical tasks while reserves outstanding reconstructed perceptual quality, compared with traditional and learning-based codecs. In addition, joint optimization schedule is adopted to acquire best balance point among compression ratio, reconstructed image quality, and downstream perception performance. Experimental results verify that our proposed compressed domain-based multi-task analysis method can achieve comparable analysis results against the RGB image-based methods with up to 99.6% bit rate saving (i.e., compared with taking original RGB image as the analysis model input). The practical ability of our model is further justified from model size and information fidelity aspects.

CVJan 15, 2023
Learning to Compress Unmanned Aerial Vehicle (UAV) Captured Video: Benchmark and Analysis

Chuanmin Jia, Feng Ye, Huifang Sun et al.

During the past decade, the Unmanned-Aerial-Vehicles (UAVs) have attracted increasing attention due to their flexible, extensive, and dynamic space-sensing capabilities. The volume of video captured by UAVs is exponentially growing along with the increased bitrate generated by the advancement of the sensors mounted on UAVs, bringing new challenges for on-device UAV storage and air-ground data transmission. Most existing video compression schemes were designed for natural scenes without consideration of specific texture and view characteristics of UAV videos. In this work, we first contribute a detailed analysis of the current state of the field of UAV video coding. Then we propose to establish a novel task for learned UAV video coding and construct a comprehensive and systematic benchmark for such a task, present a thorough review of high quality UAV video datasets and benchmarks, and contribute extensive rate-distortion efficiency comparison of learned and conventional codecs after. Finally, we discuss the challenges of encoding UAV videos. It is expected that the benchmark will accelerate the research and development in video coding on drone platforms.

CVSep 15, 2025Code
MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation

Liying Wang, Xiaoli Zhang, Chuanmin Jia et al.

Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.

MMNov 24, 2024
Advanced Learning-Based Inter Prediction for Future Video Coding

Yanchen Zhao, Wenhong Duan, Chuanmin Jia et al.

In the fourth generation Audio Video coding Standard (AVS4), the Inter Prediction Filter (INTERPF) reduces discontinuities between prediction and adjacent reconstructed pixels in inter prediction. The paper proposes a low complexity learning-based inter prediction (LLIP) method to replace the traditional INTERPF. LLIP enhances the filtering process by leveraging a lightweight neural network model, where parameters can be exported for efficient inference. Specifically, we extract pixels and coordinates utilized by the traditional INTERPF to form the training dataset. Subsequently, we export the weights and biases of the trained neural network model and implement the inference process without any third-party dependency, enabling seamless integration into video codec without relying on Libtorch, thus achieving faster inference speed. Ultimately, we replace the traditional handcraft filtering parameters in INTERPF with the learned optimal filtering parameters. This practical solution makes the combination of deep learning encoding tools with traditional video encoding schemes more efficient. Experimental results show that our approach achieves 0.01%, 0.31%, and 0.25% coding gain for the Y, U, and V components under the random access (RA) configuration on average.

CVFeb 13, 2024
A Neural-network Enhanced Video Coding Framework beyond ECM

Yanchen Zhao, Wenxuan He, Chuanmin Jia et al.

In this paper, a hybrid video compression framework is proposed that serves as a demonstrative showcase of deep learning-based approaches extending beyond the confines of traditional coding methodologies. The proposed hybrid framework is founded upon the Enhanced Compression Model (ECM), which is a further enhancement of the Versatile Video Coding (VVC) standard. We have augmented the latest ECM reference software with well-designed coding techniques, including block partitioning, deep learning-based loop filter, and the activation of block importance mapping (BIM) which was integrated but previously inactive within ECM, further enhancing coding performance. Compared with ECM-10.0, our method achieves 6.26, 13.33, and 12.33 BD-rate savings for the Y, U, and V components under random access (RA) configuration, respectively.

CVJun 9, 2025
ProSplat: Improved Feed-Forward 3D Gaussian Splatting for Wide-Baseline Sparse Views

Xiaohan Lu, Jiaye Fu, Jiaqi Zhang et al.

Feed-forward 3D Gaussian Splatting (3DGS) has recently demonstrated promising results for novel view synthesis (NVS) from sparse input views, particularly under narrow-baseline conditions. However, its performance significantly degrades in wide-baseline scenarios due to limited texture details and geometric inconsistencies across views. To address these challenges, in this paper, we propose ProSplat, a two-stage feed-forward framework designed for high-fidelity rendering under wide-baseline conditions. The first stage involves generating 3D Gaussian primitives via a 3DGS generator. In the second stage, rendered views from these primitives are enhanced through an improvement model. Specifically, this improvement model is based on a one-step diffusion model, further optimized by our proposed Maximum Overlap Reference view Injection (MORI) and Distance-Weighted Epipolar Attention (DWEA). MORI supplements missing texture and color by strategically selecting a reference view with maximum viewpoint overlap, while DWEA enforces geometric consistency using epipolar constraints. Additionally, we introduce a divide-and-conquer training strategy that aligns data distributions between the two stages through joint optimization. We evaluate ProSplat on the RealEstate10K and DL3DV-10K datasets under wide-baseline settings. Experimental results demonstrate that ProSplat achieves an average improvement of 1 dB in PSNR compared to recent SOTA methods.

CVOct 18, 2025
HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars

Haocheng Tang, Ruoke Yan, Xinhui Yin et al.

Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.

CVJun 11, 2025
TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation

Zetian Song, Jiaye Fu, Jiaqi Zhang et al.

The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a new paradigm to reconstruct 3D scenes. Using neural networks trained on large-scale multi-view datasets, it can directly infer 3DGS representations from sparse input views. Although the feedforward approach achieves high reconstruction speed, it still suffers from the substantial storage cost of 3D Gaussians. Existing 3DGS compression methods relying on scene-wise optimization are not applicable due to architectural incompatibilities. To overcome this limitation, we propose TinySplat, a complete feedforward approach for generating compact 3D scene representations. Built upon standard feedforward 3DGS methods, TinySplat integrates a training-free compression framework that systematically eliminates key sources of redundancy. Specifically, we introduce View-Projection Transformation (VPT) to reduce geometric redundancy by projecting geometric parameters into a more compact space. We further present Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy by aligning feature energy along dominant viewing directions via basis transformation. Lastly, spatial redundancy is addressed through an off-the-shelf video codec. Comprehensive experimental results on multiple benchmark datasets demonstrate that TinySplat achieves over 100x compression for 3D Gaussian data generated by feedforward methods. Compared to the state-of-the-art compression approach, we achieve comparable quality with only 6% of the storage size. Meanwhile, our compression framework requires only 25% of the encoding time and 1% of the decoding time.

CVMar 4, 2025
A Joint Visual Compression and Perception Framework for Neuralmorphic Spiking Camera

Kexiang Feng, Chuanmin Jia, Siwei Ma et al.

The advent of neuralmorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal resolution.However, this imaging attribute necessitates considerable resources for binary spike data storage and transmission.In light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (SCI), wherein spike sequences are compressed and optimized for both bit-rate and task performance.Drawing inspiration from the mammalian vision system, we propose a dual-pathway architecture for separate processing of spatial semantics and motion information, which is then merged to produce features for compression.A refinement scheme is also introduced to ensure consistency between decoded features and motion vectors.We further propose a temporal regression approach that integrates various motion dynamics, capitalizing on the advancements in warping and deformation simultaneously.Comprehensive experiments demonstrate our scheme achieves state-of-the-art (SOTA) performance for spike compression and analysis.We achieve an average 17.25% BD-rate reduction compared to SOTA codecs and a 4.3% accuracy improvement over SpiReco for spike-based classification, with 88.26% complexity reduction and 42.41% inference time saving on the encoding side.

CVDec 5, 2024
LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model

Yuan Xue, Qi Zhang, Chuanmin Jia et al.

Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.

IVJun 24, 2021
Rate Distortion Characteristic Modeling for Neural Image Compression

Chuanmin Jia, Ziqing Ge, Shanshe Wang et al.

End-to-end optimized neural image compression (NIC) has obtained superior lossy compression performance recently. In this paper, we consider the problem of rate-distortion (R-D) characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep networks. Thus arbitrary bit-rate points could be elegantly realized by leveraging such model via a single trained network. We propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. The proposed scheme resolves the problem of training distinct models to reach different points in the R-D space. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $λ$ respectively. Our experiments show our proposed method is easy to adopt and realizes state-of-the-art continuous bit-rate coding performance, which implies that our approach would benefit the practical deployment of NIC.

IVApr 21, 2021
Visual Analysis Motivated Rate-Distortion Model for Image Coding

Zhimeng Huang, Chuanmin Jia, Shanshe Wang et al.

Optimized for pixel fidelity metrics, images compressed by existing image codec are facing systematic challenges when used for visual analysis tasks, especially under low-bitrate coding. This paper proposes a visual analysis-motivated rate-distortion model for Versatile Video Coding (VVC) intra compression. The proposed model has two major contributions, a novel rate allocation strategy and a new distortion measurement model. We first propose the region of interest for machine (ROIM) to evaluate the degree of importance for each coding tree unit (CTU) in visual analysis. Then, a novel CTU-level bit allocation model is proposed based on ROIM and the local texture characteristics of each CTU. After an in-depth analysis of multiple distortion models, a visual analysis friendly distortion criteria is subsequently proposed by extracting deep feature of each coding unit (CU). To alleviate the problem of lacking spatial context information when calculating the distortion of each CU, we finally propose a multi-scale feature distortion (MSFD) metric using different neighboring pixels by weighting the extracted deep features in each scale. Extensive experimental results show that the proposed scheme could achieve up to 28.17\% bitrate saving under the same analysis performance among several typical visual analysis tasks such as image classification, object detection, and semantic segmentation.

CVMar 12, 2021
Thousand to One: Semantic Prior Modeling for Conceptual Coding

Jianhui Chang, Zhenghui Zhao, Lingbo Yang et al.

Conceptual coding has been an emerging research topic recently, which encodes natural images into disentangled conceptual representations for compression. However, the compression performance of the existing methods is still sub-optimal due to the lack of comprehensive consideration of rate constraint and reconstruction quality. To this end, we propose a novel end-to-end semantic prior modeling-based conceptual coding scheme towards extremely low bitrate image compression, which leverages semantic-wise deep representations as a unified prior for entropy estimation and texture synthesis. Specifically, we employ semantic segmentation maps as structural guidance for extracting deep semantic prior, which provides fine-grained texture distribution modeling for better detail construction and higher flexibility in subsequent high-level vision tasks. Moreover, a cross-channel entropy model is proposed to further exploit the inter-channel correlation of the spatially independent semantic prior, leading to more accurate entropy estimation for rate-constrained training. The proposed scheme achieves an ultra-high 1000x compression ratio, while still enjoying high visual reconstruction quality and versatility towards visual processing and analysis tasks.

CVNov 10, 2020
Conceptual Compression via Deep Structure and Texture Synthesis

Jianhui Chang, Zhenghui Zhao, Chuanmin Jia et al.

Existing compression methods typically focus on the removal of signal-level redundancies, while the potential and versatility of decomposing visual data into compact conceptual components still lack further study. To this end, we propose a novel conceptual compression framework that encodes visual data into compact structure and texture representations, then decodes in a deep synthesis fashion, aiming to achieve better visual reconstruction quality, flexible content manipulation, and potential support for various vision tasks. In particular, we propose to compress images by a dual-layered model consisting of two complementary visual features: 1) structure layer represented by structural maps and 2) texture layer characterized by low-dimensional deep representations. At the encoder side, the structural maps and texture representations are individually extracted and compressed, generating the compact, interpretable, inter-operable bitstreams. During the decoding stage, a hierarchical fusion GAN (HF-GAN) is proposed to learn the synthesis paradigm where the textures are rendered into the decoded structural maps, leading to high-quality reconstruction with remarkable visual realism. Extensive experiments on diverse images have demonstrated the superiority of our framework with lower bitrates, higher reconstruction quality, and increased versatility towards visual analysis and content manipulation tasks.

ASApr 7, 2020
Universal Adversarial Perturbations Generative Network for Speaker Recognition

Jiguo Li, Xinfeng Zhang, Chuanmin Jia et al.

Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs) for the speaker recognition systems. We proposed a generative network to learn the mapping from the low-dimensional normal distribution to the UAPs subspace, then synthesize the UAPs to perturbe any input signals to spoof the well-trained speaker recognition model with high probability. Experimental results on TIMIT and LibriSpeech datasets demonstrate the effectiveness of our model.

MMApr 7, 2020
Direct Speech-to-image Translation

Jiguo Li, Xinfeng Zhang, Chuanmin Jia et al.

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

CVApr 7, 2019
Image and Video Compression with Neural Networks: A Review

Siwei Ma, Xinfeng Zhang, Chuanmin Jia et al.

In recent years, the image and video coding technologies have advanced by leaps and bounds. However, due to the popularization of image and video acquisition devices, the growth rate of image and video data is far beyond the improvement of the compression ratio. In particular, it has been widely recognized that there are increasing challenges of pursuing further coding performance improvement within the traditional hybrid coding framework. Deep convolution neural network (CNN) which makes the neural network resurge in recent years and has achieved great success in both artificial intelligent and signal processing fields, also provides a novel and promising solution for image and video compression. In this paper, we provide a systematic, comprehensive and up-to-date review of neural network based image and video compression techniques. The evolution and development of neural network based compression methodologies are introduced for images and video respectively. More specifically, the cutting-edge video coding techniques by leveraging deep learning and HEVC framework are presented and discussed, which promote the state-of-the-art video coding performance substantially. Moreover, the end-to-end image and video coding frameworks based on neural networks are also reviewed, revealing interesting explorations on next generation image and video coding frameworks/standards. The most significant research works on the image and video coding related topics using neural networks are highlighted, and future trends are also envisioned. In particular, the joint compression on semantic and visual information is tentatively explored to formulate high efficiency signal representation structure for both human vision and machine vision, which are the two dominant signal receptor in the age of artificial intelligence.

MMApr 1, 2019
Layered Image Compression using Scalable Auto-encoder

Chuanmin Jia, Zhaoyi Liu, Yao Wang et al.

This paper presents a novel convolutional neural network (CNN) based image compression framework via scalable auto-encoder (SAE). Specifically, our SAE based deep image codec consists of hierarchical coding layers, each of which is an end-to-end optimized auto-encoder. The coarse image content and texture are encoded through the first (base) layer while the consecutive (enhance) layers iteratively code the pixel-level reconstruction errors between the original and former reconstructed images. The proposed SAE structure alleviates the need to train multiple models for different bit-rate points by recently proposed auto-encoder based codecs. The SAE layers can be combined to realize multiple rate points, or to produce a scalable stream. The proposed method has similar rate-distortion performance in the low-to-medium rate range as the state-of-the-art CNN based image codec (which uses different optimized networks to realize different bit rates) over a standard public image dataset. Furthermore, the proposed codec generates better perceptual quality in this bit rate range.

MMSep 25, 2017
Spatial-Temporal Residue Network Based In-Loop Filter for Video Coding

Chuanmin Jia, Shiqi Wang, Xinfeng Zhang et al.

Deep learning has demonstrated tremendous break through in the area of image/video processing. In this paper, a spatial-temporal residue network (STResNet) based in-loop filter is proposed to suppress visual artifacts such as blocking, ringing in video coding. Specifically, the spatial and temporal information is jointly exploited by taking both current block and co-located block in reference frame into consideration during the processing of in-loop filter. The architecture of STResNet only consists of four convolution layers which shows hospitality to memory and coding complexity. Moreover, to fully adapt the input content and improve the performance of the proposed in-loop filter, coding tree unit (CTU) level control flag is applied in the sense of rate-distortion optimization. Extensive experimental results show that our scheme provides up to 5.1% bit-rate reduction compared to the state-of-the-art video coding standard.