HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
This work addresses the problem of poor human action understanding in MLLMs for researchers and practitioners in video AI, though it is incremental as it focuses on data curation rather than novel model architectures.
The paper tackles the limitation of Multi-modal Large Language Models (MLLMs) in understanding human actions due to low-quality data by introducing a two-stage annotation pipeline to curate high-quality video-caption datasets, resulting in significant improvements in human action understanding across 4 benchmarks and enhanced text-to-video generation.
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.