CVJun 3

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

Shuwen Yu, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

arXiv:2606.0438568.0Has Code

Predicted impact top 46% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners using heterogeneous foundation models, GPUA provides a lightweight, task-agnostic method to bridge VFMs and VLMs, enhancing cross-model compatibility.

GPUA aligns vision-only foundation models (VFMs) with vision-language models (VLMs) via an orthogonal mapping that preserves geometry, enabling improved zero-shot recognition and segmentation without labels or parameter updates.

Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA

View on arXiv PDF Code

Similar