CVAIOct 15, 2025

Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

arXiv:2510.13364v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of optimizing zero-shot classification for visually similar categories like postures in data-scarce conditions, though it is incremental as it focuses on prompt design effects.

The study tackled the problem of how prompt design affects zero-shot classification of human postures using Vision-Language Models under data scarcity, finding that simpler prompts yield better performance for top models, with MetaCLIP 2's accuracy dropping from 68.8% to 55.1% when detail is added.

Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes