LGAug 23, 2024

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

arXiv:2408.13402v24 citationsh-index: 3Has Code
AI Analysis

This work addresses the need for efficient AI models that can run on small compute footprints to democratize access, though it appears incremental as it builds on existing MM-LLM advancements with a ternary approach.

The paper tackles the problem of making multimodal large language models (MM-LLMs) more accessible by developing LLaVaOLMoBitnet1B, the first ternary MM-LLM that accepts image and text inputs to produce coherent responses, and it is fully open-sourced to encourage further research.

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes