CVApr 13, 2025

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

arXiv:2504.09641v135.259 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient video reasoning models for researchers with limited computational resources, though it is incremental as it builds on existing small-scale models and reinforcement learning techniques.

The paper tackles the problem of limited reasoning capabilities in small-scale multimodal models for video question-answering, and shows that reinforcement learning on general datasets significantly improves reasoning and thinking abilities, with the model exhibiting emergent 'aha moments'.

Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

View on arXiv PDF Code

Similar