CVCYSep 16, 2025

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

arXiv:2509.13484v22 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses urban planning needs by enabling detection of group-level social interactions from images, though it is incremental as it builds on existing components like VLMs and human detection.

The paper tackles the problem of detecting social group interactions in urban scenes from images by introducing a new task for spatially grounding regions defined by interpersonal relations, and proposes MINGLE, a three-stage pipeline that integrates human detection, VLM-based reasoning, and spatial aggregation. The result includes a new dataset of 100K annotated images and a method that addresses semantically complex signals beyond traditional object detection.

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes