LG CLAug 22, 2024

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang

arXiv:2408.12480v217.015 citationsh-index: 6Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for efficient on-device multimodal AI applications in Vietnamese, though it is incremental as it combines existing models for a specific language context.

The researchers tackled the problem of developing an efficient multimodal large language model for Vietnamese by introducing Vintern-1B, a 1-billion-parameter model that integrates Qwen2-0.5B-Instruct and InternViT-300M-448px, achieving robust performance on benchmarks like OpenViVQA and ViTextVQA after fine-tuning on over 3 million image-question-answer pairs.

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.

View on arXiv PDF

Similar