CVIVDec 24, 2024

Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task

arXiv:2412.18158v1h-index: 11
Originality Incremental advance
AI Analysis

This addresses the need for versatile image compression that works across both human and machine vision domains, reducing retraining costs and complexity, though it appears incremental as it builds on existing learned compression and multimodal models.

The paper tackles the problem of learned image compression methods being specialized for either human visual perception or machine vision tasks, limiting versatility and requiring retraining for new applications. It introduces DISCOVER, a semantics disentanglement and composition codec that simultaneously enhances both human-eye perception and machine vision tasks, with experimental evaluations showing superior performance in fulfilling these dual objectives.

While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes