CVAIOct 31, 2023

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

arXiv:2310.20225v219 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This addresses accessibility challenges for people with blindness and low vision, but it is incremental as it combines existing models like RAM and InstructBLIP with prompt engineering.

The paper tackles the problem of scene recognition and hazard identification for people with blindness and low vision by using a large vision-language model to generate detailed environmental descriptions and warnings, achieving accurate object recognition in experiments on indoor and outdoor datasets.

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards on their own. In this paper, we present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environments and providing warnings about the potential risks. Our method begins by leveraging a large image tagging model (i.e., Recognize Anything (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV using prompt engineering. By combining the prompt and input image, a large vision-language model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method is able to recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes