CVNov 21, 2025

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Victor Li, Naveenraj Kamalakannan, Avinash Parnandi, Heidi Schambra, Carlos Fernandez-Granda

arXiv:2511.17727v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of automating stroke rehabilitation assessment for clinicians and patients, but it is incremental as it highlights current limitations and suggests future opportunities without achieving major breakthroughs.

The study applied vision-language models (VLMs) to automatically quantify rehabilitation dose and impairment from videos for stroke rehabilitation, finding that current VLMs lack fine-grained motion understanding, with dose estimates comparable to a baseline without visual information and unreliable impairment predictions, though they showed promise for high-level activity classification and dose approximation within 25% of ground truth for some participants.

Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

View on arXiv PDF

Similar