CVAIMay 16

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

arXiv:2605.1686469.6Has Code
AI Analysis

This work addresses the challenge of effectively fusing features from different VFMs for dense prediction tasks, which is important for leveraging existing models to improve downstream applications.

The paper proposes a metric-guided feature fusion method for combining complementary features from different visual foundation models (VFMs) to improve dense prediction tasks. The method achieves consistent performance gains across multiple tasks, with better object-level semantics and more accurate boundaries.

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes