CVApr 28, 2023

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

arXiv:2304.14623v29 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses safety-critical image captioning for people with vision impairments, though it is incremental as it builds on existing models with enhancements.

The paper tackles the problem of noisy images leading to incorrect and unsafe predictions in image captioning for visually impaired people by proposing a quality-agnostic framework, resulting in an absolute improvement of 2.15 on CIDEr and up to 3 points improvement in noisy settings.

Automated image captioning has the potential to be a useful tool for people with vision impairments. Images taken by this user group are often noisy, which leads to incorrect and even unsafe model predictions. In this paper, we propose a quality-agnostic framework to improve the performance and robustness of image captioning models for visually impaired people. We address this problem from three angles: data, model, and evaluation. First, we show how data augmentation techniques for generating synthetic noise can address data sparsity in this domain. Second, we enhance the robustness of the model by expanding a state-of-the-art model to a dual network architecture, using the augmented data and leveraging different consistency losses. Our results demonstrate increased performance, e.g. an absolute improvement of 2.15 on CIDEr, compared to state-of-the-art image captioning networks, as well as increased robustness to noise with up to 3 points improvement on CIDEr in more noisy settings. Finally, we evaluate the prediction reliability using confidence calibration on images with different difficulty/noise levels, showing that our models perform more reliably in safety-critical situations. The improved model is part of an assisted living application, which we develop in partnership with the Royal National Institute of Blind People.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes