AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas
This addresses the challenge of multi-person identity-preserving image generation for applications like photo editing and stylization, though it appears incremental as it builds on existing diffusion and transformer methods.
The paper tackled the problem of generating images with multiple people while preserving their identities and following text prompts, which often leads to copy-paste shortcuts. It introduced AnyPhoto, a diffusion-transformer framework that improved identity similarity and reduced copy-paste tendencies, with gains increasing as more identities were added.
Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.