LGSep 12, 2025

Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition

Ilker Demirel, Karan Thakkar, Benjamin Elizalde, Miquel Espi Marques, Shirley Ren, Jaya Narain

arXiv:2509.10729v111.48 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the challenge of integrating complementary sensor information for activity recognition, particularly in scenarios with limited aligned training data, though it appears incremental as it applies existing LLMs to a new fusion task.

The paper tackles the problem of multimodal activity recognition from audio and motion sensor data by using large language models (LLMs) for late fusion, achieving significantly above-chance zero- and one-shot F1-scores for 12-class classification on a curated Ego4D subset without task-specific training.

Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.

View on arXiv PDF

Similar