ASAICLCVIVNov 30, 2023

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

arXiv:2312.00174v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses accessibility for people with visual impairments by enabling efficient image-to-speech systems on mobile devices, though it is incremental as it focuses on compression of existing methods.

The paper tackled the challenge of deploying image-to-speech systems on low-resource devices by compressing an end-to-end neural architecture, reducing model parameters from 6.1 million to 2.46 million with minimal performance drop and a 22% speedup in inference time.

People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes