CVMay 25

Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models

arXiv:2605.2544264.2

Predicted impact top 52% in CV · last 90 daysOriginality Highly original

AI Analysis

For forensic analysts and face recognition systems, this provides a novel capability to recover identities from morph attacks without reference images, addressing a critical gap in existing detection-only methods.

This paper introduces a reference-free facial demorphing framework using Multimodal Large Language Models (MLLMs) to guide a diffusion-based reconstruction, enabling recovery of constituent identities from morphed images without prior identity overlap. The method achieves 30-40% improvement over latent-space approaches at strict operating points.

Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30--40\% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.

View on arXiv PDF

Similar