MultiClimate: Multimodal Stance Detection on Climate Change Videos
This addresses the challenge of understanding public opinions on climate change through multimodal data, but it is incremental as it builds on existing stance detection methods with a new dataset.
The paper tackles the problem of detecting stance on climate change in multimodal videos by introducing MultiClimate, a new dataset with 100 YouTube videos and 4,209 frame-transcript pairs, and shows that combining text and image modalities achieves state-of-the-art accuracy/F1 scores of 0.747/0.749.
Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with $100$ CC-related YouTube videos and $4,209$ frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, $0.747$/$0.749$ in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at https://github.com/werywjw/MultiClimate.