Region-Based Representations Revisited
This work revives region-based methods for computer vision applications, offering a compact and query-friendly approach, though it is incremental as it builds on existing segmenters and representations.
The paper tackles the problem of whether region-based representations are effective for recognition by combining class-agnostic segmenters like SAM with unsupervised representations like DINOv2, achieving competitive performance on tasks such as semantic segmentation and object-based image retrieval.
We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.