CLAICVAug 7, 2025

Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

arXiv:2508.04945v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses the challenge of ambiguous verb semantics in evaluating visual activity recognition for researchers and practitioners, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating visual activity recognition systems by addressing verb ambiguity in image descriptions, proposing a vision-language clustering framework that constructs verb sense clusters for more robust evaluation. The result shows that each image in the imSitu dataset maps to an average of 2.8 sense clusters, and the cluster-based evaluation better aligns with human judgements compared to standard methods.

Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes