CVAIFeb 14, 2025

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

arXiv:2502.09927v123 citationsh-index: 40Has Code
Originality Incremental advance
AI Analysis

This provides a practical, open-source solution for enterprise intelligence tasks, though it is incremental as it builds on existing lightweight and multimodal approaches.

The authors tackled the problem of visual document understanding in enterprise settings by introducing Granite Vision, a lightweight multimodal model that achieves strong results on benchmarks like LiveXiv, designed to avoid test set contamination.

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes