CV LG MAAug 25, 2025

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

arXiv:2508.18177v11 citationsh-index: 4Expert syst appl

Originality Incremental advance

AI Analysis

This research offers incremental improvements in computational efficiency for assistive technology, specifically benefiting visually impaired users with enhanced scene perception and navigation capabilities.

This study tackles the problem of providing real-time assistance to visually impaired users by developing a framework that reduces memory requirements from 38GB to 16GB while maintaining performance, with the quantized model showing only a 2.05% performance drop on MMBench and response latency of 2.83-3.52 seconds.

This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.

View on arXiv PDF

Similar