XMeCap: Meme Caption Generation with Sub-Image Adaptability
This work addresses the problem of multi-modal humor understanding for AI researchers, though it appears incremental as it builds on existing captioning methods with specific adaptations for memes.
The paper tackled the challenge of generating captions for memes, particularly focusing on multi-image memes, and introduced the XMeCap framework, which achieved average evaluation scores of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming baselines by 6.75% and 8.56%, respectively.
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 6.75\% and 8.56\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.