CVAug 11, 2023
BATINet: Background-Aware Text to Image Synthesis and Manipulation NetworkRyugo Morita, Zhiqiang Zhang, Jinjia Zhou
Background-Induced Text2Image (BIT2I) aims to generate foreground content according to the text on the given background image. Most studies focus on generating high-quality foreground content, although they ignore the relationship between the two contents. In this study, we analyzed a novel Background-Aware Text2Image (BAT2I) task in which the generated content matches the input background. We proposed a Background-Aware Text to Image synthesis and manipulation Network (BATINet), which contains two key components: Position Detect Network (PDN) and Harmonize Network (HN). The PDN detects the most plausible position of the text-relevant object in the background image. The HN harmonizes the generated content referring to background style information. Finally, we reconstructed the generation network, which consists of the multi-GAN and attention module to match more user preferences. Moreover, we can apply BATINet to text-guided image manipulation. It solves the most challenging task of manipulating the shape of an object. We demonstrated through qualitative and quantitative evaluations on the CUB dataset that the proposed model outperforms other state-of-the-art methods.
CVNov 25, 2022
Interactive Image Manipulation with Complex Text InstructionsRyugo Morita, Zhiqiang Zhang, Man M. Ho et al.
Recently, text-guided image manipulation has received increasing attention in the research field of multimedia processing and computer vision due to its high flexibility and controllability. Its goal is to semantically manipulate parts of an input reference image according to the text descriptions. However, most of the existing works have the following problems: (1) text-irrelevant content cannot always be maintained but randomly changed, (2) the performance of image manipulation still needs to be further improved, (3) only can manipulate descriptive attributes. To solve these problems, we propose a novel image manipulation method that interactively edits an image using complex text instructions. It allows users to not only improve the accuracy of image manipulation but also achieve complex tasks such as enlarging, dwindling, or removing objects and replacing the background with the input image. To make these tasks possible, we apply three strategies. First, the given image is divided into text-relevant content and text-irrelevant content. Only the text-relevant content is manipulated and the text-irrelevant content can be maintained. Second, a super-resolution method is used to enlarge the manipulation region to further improve the operability and to help manipulate the object itself. Third, a user interface is introduced for editing the segmentation map interactively to re-modify the generated image according to the user's desires. Extensive experiments on the Caltech-UCSD Birds-200-2011 (CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets demonstrate our proposed method can enable interactive, flexible, and accurate image manipulation in real-time. Through qualitative and quantitative evaluations, we show that the proposed model outperforms other state-of-the-art methods.
CVAug 14, 2022
Global Priors Guided Modulation Network for Joint Super-Resolution and Inverse Tone-MappingGang He, Shaoyi Long, Li Xu et al.
Joint super-resolution and inverse tone-mapping (SR-ITM) aims to enhance the visual quality of videos that have quality deficiencies in resolution and dynamic range. This problem arises when using 4K high dynamic range (HDR) TVs to watch a low-resolution standard dynamic range (LR SDR) video. Previous methods that rely on learning local information typically cannot do well in preserving color conformity and long-range structural similarity, resulting in unnatural color transition and texture artifacts. In order to tackle these challenges, we propose a global priors guided modulation network (GPGMNet) for joint SR-ITM. In particular, we design a global priors extraction module (GPEM) to extract color conformity prior and structural similarity prior that are beneficial for ITM and SR tasks, respectively. To further exploit the global priors and preserve spatial information, we devise multiple global priors guided spatial-wise modulation blocks (GSMBs) with a few parameters for intermediate feature modulation, in which the modulation parameters are generated by the shared global priors and the spatial features map from the spatial pyramid convolution block (SPCB). With these elaborate designs, the GPGMNet can achieve higher visual quality with lower computational complexity. Extensive experiments demonstrate that our proposed GPGMNet is superior to the state-of-the-art methods. Specifically, our proposed model exceeds the state-of-the-art by 0.64 dB in PSNR, with 69$\%$ fewer parameters and 3.1$\times$ speedup. The code will be released soon.
CVMay 16
Coarse Semantic Injection for LLM-Conditioned Structured Indoor PredictionShuliang Zhu, Tomiwa Adey, Jinjia Zhou
Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.
CVSep 12, 2024
Learn from Balance: Rectifying Knowledge Transfer for Long-Tailed ScenariosXinlei Huang, Jialiang Tang, Xubin Zheng et al.
Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network, making it suitable for deployment on resource-limited media terminals. However, traditional KD methods require balanced data to ensure robust training, which is often unavailable in practical applications. In such scenarios, a few head categories occupy a substantial proportion of examples. This imbalance biases the trained teacher network towards the head categories, resulting in severe performance degradation on the less represented tail categories for both the teacher and student networks. In this paper, we propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network through the incorporation of the balanced category priors. Furthermore, we rectify the biased predictions produced by the teacher network, particularly focusing on the tail categories. Consequently, the teacher network can provide balanced and accurate knowledge to train a reliable student network. Intensive experiments conducted on various long-tailed datasets demonstrate that our KRDistill can effectively train reliable student networks in realistic scenarios of data imbalance.
CVDec 11, 2024Code
Unicorn: Unified Neural Image Compression with One Number ReconstructionQi Zheng, Haozhi Wang, Zihao Liu et al.
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub \textbf{Unicorn} (\textbf{U}nified \textbf{N}eural \textbf{I}mage \textbf{C}ompression with \textbf{O}ne \textbf{N}number \textbf{R}econstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. We will release our codes in \url{https://github.com/uniqzheng/Unicorn-Laduree}.
IVSep 17, 2024
Edge-based Denoising Image CompressionRyugo Morita, Hitoshi Nishimura, Ko Watanabe et al.
In recent years, deep learning-based image compression, particularly through generative models, has emerged as a pivotal area of research. Despite significant advancements, challenges such as diminished sharpness and quality in reconstructed images, learning inefficiencies due to mode collapse, and data loss during transmission persist. To address these issues, we propose a novel compression model that incorporates a denoising step with diffusion models, significantly enhancing image reconstruction fidelity by sub-information(e.g., edge and depth) from leveraging latent space. Empirical experiments demonstrate that our model achieves superior or comparable results in terms of image quality and compression efficiency when measured against the existing models. Notably, our model excels in scenarios of partial image loss or excessive noise by introducing an edge estimation network to preserve the integrity of reconstructed images, offering a robust solution to the current limitations of image compression.
CVDec 15, 2021Code
Transcoded Video Restoration by Temporal Spatial Auxiliary NetworkLi Xu, Gang He, Jinjia Zhou et al.
In most video platforms, such as Youtube, and TikTok, the played videos usually have undergone multiple video encodings such as hardware encoding by recording devices, software encoding by video editing apps, and single/multiple video transcoding by video application servers. Previous works in compressed video restoration typically assume the compression artifacts are caused by one-time encoding. Thus, the derived solution usually does not work very well in practice. In this paper, we propose a new method, temporal spatial auxiliary network (TSAN), for transcoded video restoration. Our method considers the unique traits between video encoding and transcoding, and we consider the initial shallow encoded videos as the intermediate labels to assist the network to conduct self-supervised attention training. In addition, we employ adjacent multi-frame information and propose the temporal deformable alignment and pyramidal spatial fusion for transcoded video restoration. The experimental results demonstrate that the performance of the proposed method is superior to that of the previous techniques. The code is available at https://github.com/icecherylXuli/TSAN.
CVNov 12, 2025
GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric WorkflowRui Wan, Qi Zheng, Ruoyu Zhang et al.
The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf{24.9$\times$} and \textbf{4.1$\times$} higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf{11.7} microjoules ($\upmu$J) are required for one pixel reconstructed by this FPGA system.
IVMar 12, 2025
Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head VideosRiku Takahashi, Ryugo Morita, Fuma Kimishima et al.
Existing deep facial animation coding techniques efficiently compress talking head videos by applying deep generative models. Instead of compressing the entire video sequence, these methods focus on compressing only the keyframe and the keypoints of non-keyframes (target frames). The target frames are then reconstructed by utilizing a single keyframe, and the keypoints of the target frame. Although these unidirectional methods can reduce the bitrate, they rely on a single keyframe and often struggle to capture large head movements accurately, resulting in distortions in the facial region. In this paper, we propose a novel bidirectional learned animation codec that generates natural facial videos using past and future keyframes. First, in the Bidirectional Reference-Guided Auxiliary Stream Enhancement (BRG-ASE) process, we introduce a compact auxiliary stream for non-keyframes, which is enhanced by adaptively selecting one of two keyframes (past and future). This stream improves video quality with a slight increase in bitrate. Then, in the Bidirectional Reference-Guided Video Reconstruction (BRG-VRec) process, we animate the adaptively selected keyframe and reconstruct the target frame using both the animated keyframe and the auxiliary frame. Extensive experiments demonstrate a 55% bitrate reduction compared to the latest animation based video codec, and a 35% bitrate reduction compared to the latest video coding standard, Versatile Video Coding (VVC) on a talking head video dataset. It showcases the efficiency of our approach in improving video quality while simultaneously decreasing bitrate.
CVNov 23, 2024
TKG-DM: Training-free Chroma Key Content Generation Diffusion ModelRyugo Morita, Stanislav Frolov, Brian Bernhard Moser et al.
Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
CVNov 16, 2024
Multi-perspective Contrastive Logit DistillationQi Wang, Jinjia Zhou
In previous studies on knowledge distillation, the significance of logit distillation has frequently been overlooked. To revitalize logit distillation, we present a novel perspective by reconsidering its computation based on the semantic properties of logits and exploring how to utilize it more efficiently. Logits often contain a substantial amount of high-level semantic information; however, the conventional approach of employing logits to compute Kullback-Leibler (KL) divergence does not account for their semantic properties. Furthermore, this direct KL divergence computation fails to fully exploit the potential of logits. To address these challenges, we introduce a novel and efficient logit distillation method, Multi-perspective Contrastive Logit Distillation (MCLD), which substantially improves the performance and efficacy of logit distillation. In comparison to existing logit distillation methods and complex feature distillation methods, MCLD attains state-of-the-art performance in image classification, and transfer learning tasks across multiple datasets, including CIFAR-100, ImageNet, Tiny-ImageNet, and STL-10. Additionally, MCLD exhibits superior training efficiency and outstanding performance with distilling on Vision Transformers, further emphasizing its notable advantages. This study unveils the vast potential of logits in knowledge distillation and seeks to offer valuable insights for future research.
CVNov 15, 2024
Visual question answering based evaluation metrics for text-to-image generationMizuki Miyamoto, Ryugo Morita, Jinjia Zhou
Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we use Non-Reference Image Quality Assessment(NR-IQA) to evaluate not only the text-image alignment but also the quality of the generated images. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.
CVSep 16, 2025
Adaptive Sampling SchedulerQi Wang, Shuliang Zhu, Jinjia Zhou
Consistent distillation methods have evolved into effective techniques that significantly accelerate the sampling process of diffusion models. Although existing methods have achieved remarkable results, the selection of target timesteps during distillation mainly relies on deterministic or stochastic strategies, which often require sampling schedulers to be designed specifically for different distillation processes. Moreover, this pattern severely limits flexibility, thereby restricting the full sampling potential of diffusion models in practical applications. To overcome these limitations, this paper proposes an adaptive sampling scheduler that is applicable to various consistency distillation frameworks. The scheduler introduces three innovative strategies: (i) dynamic target timestep selection, which adapts to different consistency distillation frameworks by selecting timesteps based on their computed importance; (ii) Optimized alternating sampling along the solution trajectory by guiding forward denoising and backward noise addition based on the proposed time step importance, enabling more effective exploration of the solution space to enhance generation performance; and (iii) Utilization of smoothing clipping and color balancing techniques to achieve stable and high-quality generation results at high guidance scales, thereby expanding the applicability of consistency distillation models in complex generation scenarios. We validated the effectiveness and flexibility of the adaptive sampling scheduler across various consistency distillation methods through comprehensive experimental evaluations. Experimental results consistently demonstrated significant improvements in generative performance, highlighting the strong adaptability achieved by our method.
CVAug 6, 2025
TopKD: Top-scaled Knowledge DistillationQi Wang, Jinjia Zhou
Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.
IVJun 16, 2025
Audio-Visual Driven Compression for Low-Bitrate Talking Head VideosRiku Takahashi, Ryugo Morita, Jinjia Zhou
Talking head video compression has advanced with neural rendering and keypoint-based methods, but challenges remain, especially at low bit rates, including handling large head movements, suboptimal lip synchronization, and distorted facial reconstructions. To address these problems, we propose a novel audio-visual driven video codec that integrates compact 3D motion features and audio signals. This approach robustly models significant head rotations and aligns lip movements with speech, improving both compression efficiency and reconstruction quality. Experiments on the CelebV-HQ dataset show that our method reduces bitrate by 22% compared to VVC and by 8.5% over state-of-the-art learning-based codec. Furthermore, it provides superior lip-sync accuracy and visual fidelity at comparable bitrates, highlighting its effectiveness in bandwidth-constrained scenarios.
CVNov 15, 2024
Block based Adaptive Compressive Sensing with Sampling Rate ControlKosuke Iwama, Ryugo Morita, Jinjia Zhou
Compressive sensing (CS), acquiring and reconstructing signals below the Nyquist rate, has great potential in image and video acquisition to exploit data redundancy and greatly reduce the amount of sampled data. To further reduce the sampled data while keeping the video quality, this paper explores the temporal redundancy in video CS and proposes a block based adaptive compressive sensing framework with a sampling rate (SR) control strategy. To avoid redundant compression of non-moving regions, we first incorporate moving block detection between consecutive frames, and only transmit the measurements of moving blocks. The non-moving regions are reconstructed from the previous frame. In addition, we propose a block storage system and a dynamic threshold to achieve adaptive SR allocation to each frame based on the area of moving regions and target SR for controlling the average SR within the target SR. Finally, to reduce blocking artifacts and improve reconstruction quality, we adopt a cooperative reconstruction of the moving and non-moving blocks by referring to the measurements of the non-moving blocks from the previous frame. Extensive experiments have demonstrated that this work is able to control SR and obtain better performance than existing works.
MMOct 10, 2021
Real-time FPGA Design for OMP Targeting 8K Image ReconstructionJiayao Xu, Chen Fu, Zhiqiang Zhang et al.
During the past decade, implementing reconstruction algorithms on hardware has been at the center of much attention in the field of real-time reconstruction in Compressed Sensing (CS). Orthogonal Matching Pursuit (OMP) is the most widely used reconstruction algorithm on hardware implementation because OMP obtains good quality reconstruction results under a proper time cost. OMP includes Dot Product (DP) and Least Square Problem (LSP). These two parts have numerous division calculations and considerable vector-based multiplications, which limit the implementation of real-time reconstruction on hardware. In the theory of CS, besides the reconstruction algorithm, the choice of sensing matrix affects the quality of reconstruction. It also influences the reconstruction efficiency by affecting the hardware architecture. Thus, designing a real-time hardware architecture of OMP needs to take three factors into consideration. The choice of sensing matrix, the implementation of DP and LSP. In this paper, a sensing matrix, which is sparsity and contains zero vectors mainly, is adopted to optimize the OMP reconstruction to break the bottleneck of reconstruction efficiency. Based on the features of the chosen matrix, the DP and LSP are implemented by simple shift, add and comparing procedures. This work is implemented on the Xilinx Virtex UltraScale+ FPGA device. To reconstruct a digital signal with 1024 length under 0.25 sampling rate, the proposal method costs 0.818us while the state-of-the-art costs 238$us. Thus, this work speedups the state-of-the-art method 290 times. This work costs 0.026s to reconstruct an 8K gray image, which achieves 30FPS real-time reconstruction.
MMJul 14, 2021
RCLC: ROI-based joint conventional and learning video compressionTrinh Man Hoang, Jinjia Zhou
COVID-19 leads to the high demand for remote interactive systems ever seen. One of the key elements of these systems is video streaming, which requires a very high network bandwidth due to its specific real-time demand, especially with high-resolution video. Existing video compression methods are struggling in the trade-off between video quality and the speed requirement. Addressed that the background information rarely changes in most remote meeting cases, we introduce a Region-Of-Interests (ROI) based video compression framework (named RCLC) that leverages the cutting-edge learning-based and conventional technologies. In RCLC, each coming frame is marked as a background-updating (BU) or ROI-updating (RU) frame. By applying the conventional video codec, the BU frame is compressed with low-quality and high-compression, while the ROI from RU-frame is compressed with high-quality and low-compression. The learning-based methods are applied to detect the ROI, blend background-ROI, and enhance video quality. The experimental results show that our RCLC can reduce up to 32.55\% BD-rate for the ROI region compared to H.265 video codec under a similar compression time with 1080p resolution.
CVFeb 11, 2021
Deep Photo Scan: Semi-Supervised Learning for dealing with the real-world degradation in Smartphone Photo ScanningMan M. Ho, Jinjia Zhou
Physical photographs now can be conveniently scanned by smartphones and stored forever as a digital version, yet the scanned photos are not restored well. One solution is to train a supervised deep neural network on many digital photos and the corresponding scanned photos. However, it requires a high labor cost, leading to limited training data. Previous works create training pairs by simulating degradation using image processing techniques. Their synthetic images are formed with perfectly scanned photos in latent space. Even so, the real-world degradation in smartphone photo scanning remains unsolved since it is more complicated due to lens defocus, lighting conditions, losing details via printing. Besides, locally structural misalignment still occurs in data due to distorted shapes captured in a 3-D world, reducing restoration performance and the reliability of the quantitative evaluation. To solve these problems, we propose a semi-supervised Deep Photo Scan (DPScan). First, we present a way of producing real-world degradation and provide the DIV2K-SCAN dataset for smartphone-scanned photo restoration. Also, Local Alignment is proposed to reduce the minor misalignment remaining in data. Second, we simulate many different variants of the real-world degradation using low-level image transformation to gain a generalization in smartphone-scanned image properties, then train a degradation network to generalize all styles of degradation and provide pseudo-scanned photos for unscanned images as if they were scanned by a smartphone. Finally, we propose a Semi-Supervised Learning that allows our restoration network to be trained on both scanned and unscanned images, diversifying training image content. As a result, the proposed DPScan quantitatively and qualitatively outperforms its baseline architecture, state-of-the-art academic research, and industrial products in smartphone photo scanning.
IVJan 24, 2021
Image Compression with Encoder-Decoder Matched Semantic SegmentationTrinh Man Hoang, Jinjia Zhou, Yibo Fan
In recent years, layered image compression is demonstrated to be a promising direction, which encodes a compact representation of the input image and apply an up-sampling network to reconstruct the image. To further improve the quality of the reconstructed image, some works transmit the semantic segment together with the compressed image data. Consequently, the compression ratio is also decreased because extra bits are required for transmitting the semantic segment. To solve this problem, we propose a new layered image compression framework with encoder-decoder matched semantic segmentation (EDMS). And then, followed by the semantic segmentation, a special convolution neural network is used to enhance the inaccurate semantic segment. As a result, the accurate semantic segment can be obtained in the decoder without requiring extra bits. The experimental results show that the proposed EDMS framework can get up to 35.31% BD-rate reduction over the HEVC-based (BPG) codec, 5% bitrate, and 24% encoding time saving compare to the state-of-the-art semantic-based image codec.
IVJan 22, 2021
B-DRRN: A Block Information Constrained Deep Recursive Residual Network for Video Compression Artifacts ReductionTrinh Man Hoang, Jinjia Zhou
Although the video compression ratio nowadays becomes higher, the video coders such as H.264/AVC, H.265/HEVC, H.266/VVC always suffer from the video artifacts. In this paper, we design a neural network to enhance the quality of the compressed frame by leveraging the block information, called B-DRRN (Deep Recursive Residual Network with Block information). Firstly, an extra network branch is designed for leveraging the block information of the coding unit (CU). Moreover, to avoid a great increase in the network size, Recursive Residual structure and sharing weight techniques are applied. We also conduct a new large-scale dataset with 209,152 training samples. Experimental results show that the proposed B-DRRN can reduce 6.16% BD-rate compared to the HEVC standard. After efficiently adding an extra network branch, this work can improve the performance of the main network without increasing any memory for storing.
CVJul 21, 2020
Deep Preset: Blending and Retouching Photos with Color Style TransferMan M. Ho, Jinjia Zhou
End-users, without knowledge in photography, desire to beautify their photos to have a similar color style as a well-retouched reference. However, the definition of style in recent image style transfer works is inappropriate. They usually synthesize undesirable results due to transferring exact colors to the wrong destination. It becomes even worse in sensitive cases such as portraits. In this work, we concentrate on learning low-level image transformation, especially color-shifting methods, rather than mixing contextual features, then present a novel scheme to train color style transfer with ground-truth. Furthermore, we propose a color style transfer named Deep Preset. It is designed to 1) generalize the features representing the color transformation from content with natural colors to retouched reference, then blend it into the contextual features of content, 2) predict hyper-parameters (settings or preset) of the applied low-level color transformation methods, 3) stylize content to have a similar color style as reference. We script Lightroom, a powerful tool in editing photos, to generate 600,000 training samples using 1,200 images from the Flick2K dataset and 500 user-generated presets with 69 settings. Experimental results show that our Deep Preset outperforms the previous works in color style transfer quantitatively and qualitatively.
CVJun 13, 2020
Semantic-driven ColorizationMan M. Ho, Lu Zhang, Alexander Raake et al.
Recent colorization works implicitly predict the semantic information while learning to colorize black-and-white images. Consequently, the generated color is easier to be overflowed, and the semantic faults are invisible. As a human experience in colorization, our brains first detect and recognize the objects in the photo, then imagine their plausible colors based on many similar objects we have seen in real life, and finally colorize them, as described in the teaser. In this study, we simulate that human-like action to let our network first learn to understand the photo, then colorize it. Thus, our work can provide plausible colors at a semantic level. Plus, the semantic information of the learned model becomes understandable and able to interact. Additionally, we also prove that Instance Normalization is also a missing ingredient for colorization, then re-design the inference flow of U-Net to have two streams of data, providing an appropriate way of normalizing the feature maps from the black-and-white image and its semantic map. As a result, our network can provide plausible colors competitive to the typical colorization works for specific objects.