CVOct 29, 2025

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

arXiv:2510.25070v1h-index: 2

Originality Incremental advance

AI Analysis

It addresses the problem of recognizing new objects and contexts without labeled data for applications in real-world scene analysis, representing an incremental advance by combining existing pre-trained models.

This work tackled zero-shot scene understanding in real-world environments by proposing a vision-language integration framework, achieving up to 18% improvement in top-1 accuracy over state-of-the-art models on tasks like object recognition and scene captioning.

Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.

View on arXiv PDF

Similar