CVSep 26, 2024

EgoLM: Multi-Modal Language Model of Egocentric Motions

arXiv:2409.18127v129 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the need for contextual AI in wearable devices by improving egomotion tracking and understanding, though it appears incremental as it builds on existing LLM methods for a specific domain.

The authors tackled the problem of learning egocentric motions from multi-modal inputs by introducing EgoLM, a framework that uses large language models to model the joint distribution of motions and natural language, achieving effectiveness as a generalist model validated on a large-scale dataset.

As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes