FaceGemma: Enhancing Image Captioning with Facial Attributes for Portrait Images
This work addresses the need for more descriptive image captions in accessibility and visual understanding, specifically for portrait images, but it is incremental as it builds on existing models and datasets.
The researchers tackled the problem of generating accurate captions for portrait images by introducing FaceGemma, a model that incorporates facial attributes like emotions and features, which achieved a BLEU-1 score of 0.364 and a METEOR score of 0.355.
Automated image caption generation is essential for improving the accessibility and understanding of visual content. In this study, we introduce FaceGemma, a model that accurately describes facial attributes such as emotions, expressions, and features. Using FaceAttdb data, we generated descriptions for 2000 faces with the Llama 3 - 70B model and fine-tuned the PaliGemma model with these descriptions. Based on the attributes and captions supplied in FaceAttDB, we created a new description dataset where each description perfectly depicts the human-annotated attributes, including key features like attractiveness, full lips, big nose, blond hair, brown hair, bushy eyebrows, eyeglasses, male, smile, and youth. This detailed approach ensures that the generated descriptions are closely aligned with the nuanced visual details present in the images. Our FaceGemma model leverages an innovative approach to image captioning by using annotated attributes, human-annotated captions, and prompt engineering to produce high-quality facial descriptions. Our method significantly improved caption quality, achieving an average BLEU-1 score of 0.364 and a METEOR score of 0.355. These metrics demonstrate the effectiveness of incorporating facial attributes into image captioning, providing more accurate and descriptive captions for portrait images.