Zeyu Xu

CV
h-index2
3papers
15citations
Novelty48%
AI Score36

3 Papers

CVAug 3, 2025
E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Zeyu Xu, Junkang Zhang, Qiang Wang et al.

Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM's capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.

CVAug 3, 2025
MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning

Yi Liu, Xiao Xu, Zeyu Xu et al.

Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model's performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.

MMMay 13, 2018
Video Processing on the Edge for Multimedia IoT Systems

Yang Cao, Zeyu Xu, Peng Qin et al.

In this article, we first survey the current situation of video processing on the edge for multimedia Internet-of-Things (M-IoT) systems in three typical scenarios, i.e., smart cities, satellite networks, and Internet-of-Vehicles. By summarizing a general model of the edge video processing, the importance of developing an edge computing platform is highlighted. Then, we give a method of implementing cooperative video processing on an edge computing platform based on light-weighted virtualization technologies. Performance evaluation is conducted and some insightful observations can be obtained. Moreover, we summarize challenges and opportunities of realizing effective edge video processing for M-IoT systems.