CVCLJun 16, 2025

Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

arXiv:2506.13458v1h-index: 7
Originality Synthesis-oriented
AI Analysis

This work addresses activity recognition for indexing, safety, and assistive applications, but it is incremental as it applies an existing method to a new dataset.

The paper tackled the problem of recognizing human activities in still images without motion cues, achieving 76% accuracy by fine-tuning multimodal CLIP, compared to 41% with scratch CNNs.

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes