CVLGDec 31, 2024

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

arXiv:2501.00432v11 citationsh-index: 21ICASSP
Originality Incremental advance
AI Analysis

This addresses the need for adaptable surveillance systems by enabling recognition of diverse and unpredictable interactions, though it is incremental as it builds on existing cross-modal methods.

The paper tackles the problem of recognizing human-to-human interactions in real-world settings by proposing an open vocabulary framework that uses large language models to generate textual descriptions, outperforming traditional fixed-vocabulary systems.

Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes