LGJun 10, 2025

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

arXiv:2506.08645v22 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating distinct embeddings for improved performance in multi-modal and unimodal tasks, though it appears incremental as it builds on existing kernel and fusion techniques.

The paper tackles the problem of fusing complementary embeddings from different models by proposing a kernel multiplication approach via the Kronecker product, which enhances modality-specific performance while preserving cross-modal alignment, as demonstrated in experiments bridging cross-modal and unimodal models.

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes