FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
This work addresses efficiency and performance issues in text-oriented vision-language models for high-resolution image understanding, representing an incremental improvement over existing methods.
The paper tackles the inefficiency of high-resolution visual tokens in Vision Large Language Models by proposing a lightweight self-distillation pre-training and high-quality post-train framework for visual token compression, resulting in reduced computational overhead and improved performance on text-oriented benchmarks.
The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.