SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
This work addresses the specific problem of limited generalization in ADL video analysis for applications like healthcare or assistive technologies, representing an incremental advance by combining existing skeleton and language methods.
The paper tackles the problem of vision-language models failing to capture challenges in Activities of Daily Living (ADL) videos by introducing SKI models, which integrate 3D skeletons into the embedding space, resulting in improved performance on zero-shot action recognition and video caption generation tasks across three ADL datasets.
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.