Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
This work addresses dense perception for computer vision applications, offering a more efficient and accurate approach, though it is incremental as it builds on existing editing models.
The paper tackled the problem of dense perception tasks like depth, normal, and matting by proposing Edit2Perceive, a unified diffusion framework that adapts image editing models, achieving state-of-the-art results with up to 10x faster runtime.
Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.