CV DBMay 26

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

arXiv:2605.2660179.9

Predicted impact top 28% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the lack of reproducible training and evaluation infrastructure for Tibetan, a severely underserved low-resource language, in vision-language research.

FTibSuite provides the first standardized resource suite for Tibetan vision-language modeling, including datasets, benchmarks, and a baseline model, achieving significant performance gains (e.g., MMBench accuracy from 42.97 to 67.78) while preserving original Chinese capabilities.

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

View on arXiv PDF

Similar