PICD: Versatile Perceptual Image Compression with Diffusion Rendering
This addresses the problem of high-quality compression for screen content (e.g., text) for users in applications like remote desktop or document sharing, but it is incremental as it builds on existing perceptual compression and diffusion models.
The paper tackles the problem of perceptual image compression for screen content, which often produces artifacts when compressing text, by proposing PICD, a codec that encodes text and image separately and renders them using a diffusion model with three-level conditional information. The result shows that PICD surpasses existing perceptual codecs in text accuracy and perceptual quality, and also works effectively for natural images without text conditions.
Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, for screen content, existing methods often produce noticeable artifacts when compressing text. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (PICD), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.