CVDBMay 26

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

arXiv:2605.2660179.9
Predicted impact top 28% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the lack of reproducible training and evaluation infrastructure for Tibetan, a severely underserved low-resource language, in vision-language research.

FTibSuite provides the first standardized resource suite for Tibetan vision-language modeling, including datasets, benchmarks, and a baseline model, achieving significant performance gains (e.g., MMBench accuracy from 42.97 to 67.78) while preserving original Chinese capabilities.

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes