Giles Hamilton-Fletcher

CV
h-index12
6papers
8citations
Novelty33%
AI Score45

6 Papers

CVJan 26Code
Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher et al.

This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

CVFeb 2Code
Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type

Junchi Feng, Nikhil Ballem, Mahya Beheshti et al.

Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone's main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone's main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.

CVApr 13
Scene Change Detection with Vision-Language Representation Learning

Diwei Sheng, Vijayraj Gohil, Satyam Gaba et al.

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

CVMar 6, 2025
Robust Computer-Vision based Construction Site Detection for Assistive-Technology Applications

Junchi Feng, Giles Hamilton-Fletcher, Nikhil Ballem et al.

Purpose: Navigating urban environments poses significant challenges for individuals who are blind or have low vision, especially in areas affected by construction. Construction zones introduce hazards such as uneven surfaces, barriers, hazardous materials, excessive noise, and altered routes that obstruct familiar paths and compromise safety. Although navigation tools assist in trip planning, they often overlook these temporary obstacles. Existing hazard detection systems also struggle with the visual variability of construction sites. Methods: We developed a computer vision--based assistive system integrating three modules: an open-vocabulary object detector to identify diverse construction-related elements, a YOLO-based model specialized in detecting scaffolding and poles, and an optical character recognition module to interpret construction signage. Results: In static testing at seven construction sites using images from multiple stationary viewpoints, the system achieved 88.56% overall accuracy. It consistently identified relevant objects within 2--10 meters and at approach angles up to 75$^{\circ}$. At 2--4 meters, detection was perfect (100%) across all angles. Even at 10 meters, six of seven sites remained detectable within a 15$^{\circ}$ approach. In dynamic testing along a 0.5-mile urban route containing eight construction sites, the system analyzed every frame of a first-person walking video. It achieved 87.26% accuracy in distinguishing construction from non-construction areas, rising to 92.0% with a 50-frame majority vote filter. Conclusion: The system can reliably detect construction sites in real time and at sufficient distances to provide advance warnings, enabling individuals with visual impairments to make safer mobility decisions such as proceeding with caution or rerouting.

HCSep 22, 2025
Does Embodiment Matter to Biomechanics and Function? A Comparative Analysis of Head-Mounted and Hand-Held Assistive Devices for Individuals with Blindness and Low Vision

Gaurav Seth, Hoa Pham, Giles Hamilton-Fletcher et al.

Visual assistive technologies, such as Microsoft Seeing AI, can improve access to environmental information for persons with blindness or low vision (pBLV). Yet, the physical and functional implications of different device embodiments remain unclear. In this study, 11 pBLV participants used Seeing AI on a hand-held smartphone and on a head-mounted ARx Vision system to perform six activities of daily living, while their movements were captured with Xsens motion capture. Functional outcomes included task time, success rate, and number of attempts, and biomechanical measures included joint range of motion, angular path length, working volume, and movement smoothness. The head-mounted system generally reduced upper-body movement and task time, especially for document-scanning style tasks, whereas the hand-held system yielded higher success rates for tasks involving small or curved text. These findings indicate that both embodiments are viable, but they differ in terms of physical demands and ease of use. Incorporating biomechanical measures into assistive technology evaluations can inform designs that optimise user experience by balancing functional efficiency, physical sustainability, and intuitive interaction.

CVJun 11, 2024
Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation

Diwei Sheng, Giles Hamilton-Fletcher, Mahya Beheshti et al.

Curbs serve as vital borders that delineate safe pedestrian zones from potential vehicular traffic hazards. Curbs also represent a primary spatial hazard during dynamic navigation with significant stumbling potential. Such vulnerabilities are particularly exacerbated for persons with blindness and low vision (PBLV). Accurate visual-based discrimination of curbs is paramount for assistive technologies that aid PBLV with safe navigation in urban environments. Herein, we investigate the efficacy of curb segmentation for foundation models. We introduce the largest curb segmentation dataset to-date to benchmark leading foundation models. Our results show that state-of-the-art foundation models face significant challenges in curb segmentation. This is due to their high false-positive rates (up to 95%) with poor performance distinguishing curbs from curb-like objects or non-curb areas, such as sidewalks. In addition, the best-performing model averaged a 3.70-second inference time, underscoring problems in providing real-time assistance. In response, we propose solutions including filtered bounding box selections to achieve more accurate curb segmentation. Overall, despite the immediate flexibility of foundation models, their application for practical assistive technology applications still requires refinement. This research highlights the critical need for specialized datasets and tailored model training to address navigation challenges for PBLV and underscores implicit weaknesses in foundation models.