CVOct 28, 2024

Face-MLLM: A Large Face Perception Model

arXiv:2410.20717v113 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses a gap in MLLMs for face perception, which is important for applications in human-computer interaction and security, but it is incremental as it builds on existing MLLM frameworks with specialized data and training.

The authors tackled the problem of multimodal large language models (MLLMs) performing poorly on face perception tasks by constructing enriched datasets and developing a three-stage training method, resulting in a model that surpasses previous MLLMs on five tasks and shows superior zero-shot performance on a new facial attribute analysis task.

Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes