CVSYNov 13, 2025

VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

arXiv:2511.10074v12 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of spectral inefficiency in wireless communication for multimodal data, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of inefficient multimodal semantic communication by proposing VLF-MSC, a system that transmits a single compact vision-language representation to support both image and text generation, achieving higher semantic accuracy under low SNR with reduced bandwidth compared to baselines.

We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes