CVDec 16, 2024

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

arXiv:2412.11475v210.510 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This enables efficient deployment of vision-language models on edge devices, though it is incremental as it builds on existing methods with a focus on compression and speed.

The paper tackles the problem of efficient on-device vision-language inference by introducing OmniVLM, a sub-billion-parameter model that reduces visual tokens from 729 to 81, achieving 9.1x faster time-to-first-token and 1.5x higher decoding speed compared to baselines while matching performance on benchmarks like ScienceQA.

We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{https://huggingface.co/NexaAIDev/OmniVLM-968M}, and the inference examples can be find in Appendix B.

View on arXiv PDF

Similar