CVMay 22

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

arXiv:2605.2417662.3

AI Analysis

For researchers in portrait animation, Loki reduces model complexity and data requirements while improving disentanglement of identity, expression, and pose.

Loki introduces a diffusion-based portrait animation method that uses a parametric face model to encode driver expression and pose, factorizing them from identity, enabling cross-ID reenactment without cross-ID training data. It achieves ~43% fewer inference parameters than leading baselines, trained on 1496x less video data, and leads on two new metrics for pose and expression fidelity.

Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.

View on arXiv PDF

Similar