Andres Sevtsuk

CV
h-index2
3papers
26citations
Novelty42%
AI Score32

3 Papers

CVJun 28, 2022
Towards Global-Scale Crowd+AI Techniques to Map and Assess Sidewalks for People with Disabilities

Maryam Hosseini, Mikey Saugstad, Fabio Miranda et al. · mit, uw

There is a lack of data on the location, condition, and accessibility of sidewalks across the world, which not only impacts where and how people travel but also fundamentally limits interactive mapping tools and urban analytics. In this paper, we describe initial work in semi-automatically building a sidewalk network topology from satellite imagery using hierarchical multi-scale attention models, inferring surface materials from street-level images using active learning-based semantic segmentation, and assessing sidewalk condition and accessibility features using Crowd+AI. We close with a call to create a database of labeled satellite and streetscape scenes for sidewalks and sidewalk accessibility issues along with standardized benchmarks.

CVSep 16, 2025
MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu, Alexandra Kudaeva, Marco Cipriano et al.

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

CVJun 3, 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection

Maryam Hosseini, Marco Cipriano, Sedigheh Eslami et al.

Existing Open Vocabulary Detection (OVD) models exhibit a number of challenges. They often struggle with semantic consistency across diverse inputs, and are often sensitive to slight variations in input phrasing, leading to inconsistent performance. The calibration of their predictive confidence, especially in complex multi-label scenarios, remains suboptimal, frequently resulting in overconfident predictions that do not accurately reflect their context understanding. To understand these limitations, multi-label detection benchmarks are needed. A particularly challenging domain for such benchmarking is social activities. Due to the lack of multi-label benchmarks for social interactions, in this work we present ELSA: Evaluating Localization of Social Activities. ELSA draws on theoretical frameworks in urban sociology and design and uses in-the-wild street-level imagery, where the size of groups and the types of activities vary significantly. ELSA includes more than 900 manually annotated images with more than 4,300 multi-labeled bounding boxes for individual and group activities. We introduce a novel confidence score computation method NLSE and a novel Dynamic Box Aggregation (DBA) algorithm to assess semantic consistency in overlapping predictions. We report our results on the widely-used SOTA models Grounding DINO, Detic, OWL, and MDETR. Our evaluation protocol considers semantic stability and localization accuracy and further exposes the limitations of existing approaches.