From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
It provides a more effective and efficient approach for text-based segmentation by repurposing generative models for discriminative tasks, benefiting applications requiring flexible object delineation.
RLFSeg uses Rectified Flow to directly map images to segmentation masks, avoiding diffusion models' noise-denoise process, achieving superior zero-shot performance (e.g., 5.1% mIoU gain over diffusion-based methods on unseen datasets) and high accuracy with single-step inference.
Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.