CV AI LGOct 13, 2025

Topological Alignment of Shared Vision-Language Embedding Space

arXiv:2510.10889v12 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the issue of English bias in multilingual vision-language models for researchers and practitioners, offering a general method for topological alignment in representation learning, though it is incremental as it builds on existing multilingual extensions.

The paper tackled the problem of biased cross-modal alignment in multilingual vision-language models by introducing ToMCLIP, a topology-aware framework that uses persistent homology to align embedding spaces, resulting in enhanced structural coherence, higher zero-shot accuracy on CIFAR-100, and stronger multilingual retrieval performance on xFlickr&CO.

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

View on arXiv PDF

Similar