CVNov 19, 2023Code
GeoSAM: Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure SegmentationRafi Ibn Sultan, Chengyin Li, Hui Zhu et al.
In geographical image segmentation, performance is often constrained by the limited availability of training data and a lack of generalizability, particularly for segmenting mobility infrastructure such as roads, sidewalks, and crosswalks. Vision foundation models like the Segment Anything Model (SAM), pre-trained on millions of natural images, have demonstrated impressive zero-shot segmentation performance, providing a potential solution. However, SAM struggles with geographical images, such as aerial and satellite imagery, due to its training being confined to natural images and the narrow features and textures of these objects blending into their surroundings. To address these challenges, we propose Geographical SAM (GeoSAM), a SAM-based framework that fine-tunes SAM using automatically generated multi-modal prompts. Specifically, GeoSAM integrates point prompts from a pre-trained task-specific model as primary visual guidance, and text prompts generated by a large language model as secondary semantic guidance, enabling the model to better capture both spatial structure and contextual meaning. GeoSAM outperforms existing approaches for mobility infrastructure segmentation in both familiar and completely unseen regions by at least 5\% in mIoU, representing a significant leap in leveraging foundation models to segment mobility infrastructure, including both road and pedestrian infrastructure in geographical images. The source code can be found in this GitHub Repository: https://github.com/rafiibnsultan/GeoSAM.
DSMar 31
Speeding-up Graph Algorithms via Clique PartitioningAkshar Chavan, Sanaz Rabinia, Daniel Grosu et al.
Reducing the running time of graph algorithms is vital for tackling real-world problems such as shortest paths and matching in large-scale graphs, where path information plays a crucial role. To address this critical challenge, this paper introduces a graph restructuring algorithm that identifies bipartite cliques and replaces them with tripartite graphs. This restructuring leads to fewer edges while preserving complete graph path information, enabling the direct application of algorithms like matching and all-pairs shortest paths to achieve significant runtime reductions, especially for large, dense graphs. The running time of the proposed algorithm for a graph $G(V,E)$, with $|V| = n$ and $|E| = m$ is~$O(mn^δ)$, which is better than $O(mn^δ\log^2 n)$, the running time of the best existing algorithm for speeding-up other graph algorithms (the Feder-Motwani (\textsf{FM}) algorithm), where $0 \leq δ\leq 1$. Both the \textsf{FM} algorithm and the proposed algorithm are originally formulated for bipartite graphs, but can also be applied to general directed or undirected graphs. Our extensive experimental analysis demonstrates that the proposed algorithm achieves up to 21.26\% higher reduction in the number of edges and runs up to 105.18$\times$ faster than the \textsf{FM} algorithm. On large synthetic graphs with up to 1.05 billion edges, it attains a reduction in the number of edges of up to 74.36\%. On real-world graphs, it achieves a reduction in the number of edges by up to 46.8\%. Furthermore, when used as a preprocessing step, our approach yields up to a 2.07$\times$ speedup for the matching algorithms on large synthetic graphs, and up to a 1.74$\times$ speedup for the All-Pairs Shortest Path algorithms on real-world graphs, when compared to using the given graph as input.
CVMar 11
WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian NavigationRafi Ibn Sultan, Hui Zhu, Xiangyu Zhou et al.
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
LGJan 10, 2021
Adversarially Robust and Explainable Model Compression with On-Device Personalization for Text ClassificationYao Qiang, Supriya Tumkur Suresh Kumar, Marco Brocanelli et al.
On-device Deep Neural Networks (DNNs) have recently gained more attention due to the increasing computing power of the mobile devices and the number of applications in Computer Vision (CV), Natural Language Processing (NLP), and Internet of Things (IoTs). Unfortunately, the existing efficient convolutional neural network (CNN) architectures designed for CV tasks are not directly applicable to NLP tasks and the tiny Recurrent Neural Network (RNN) architectures have been designed primarily for IoT applications. In NLP applications, although model compression has seen initial success in on-device text classification, there are at least three major challenges yet to be addressed: adversarial robustness, explainability, and personalization. Here we attempt to tackle these challenges by designing a new training scheme for model compression and adversarial robustness, including the optimization of an explainable feature mapping objective, a knowledge distillation objective, and an adversarially robustness objective. The resulting compressed model is personalized using on-device private training data via fine-tuning. We perform extensive experiments to compare our approach with both compact RNN (e.g., FastGRNN) and compressed RNN (e.g., PRADO) architectures in both natural and adversarial NLP test settings.