CVHCFeb 13

GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

arXiv:2602.13479v1h-index: 11
Originality Highly original
AI Analysis

This enables sustained VQA sessions on resource-constrained wearables, solving a practical problem for wearable device users and developers.

The paper tackled the challenge of deploying text-based visual question answering (Text VQA) on wearable devices by addressing the trade-off between high-resolution video for text recognition and low power consumption, achieving 72% accuracy at 0.49x the power consumption of full-resolution streaming.

Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes