LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
This addresses the need for better video understanding in daily activities, which is crucial for applications like assistive technologies, but it is incremental as it builds on existing LLVM frameworks with specialized data and training.
The paper tackles the problem of Large Language Vision Models (LLVMs) struggling with fine-grained details and complex human-object interactions in Activities of Daily Living (ADL) by proposing LLAVIDAL, a model that integrates videos, 3D skeletons, and HOIs with a Multimodal Progressive training strategy, achieving state-of-the-art performance on ADL benchmarks.
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.