LGMay 1

Towards Customized Multimodal Role-Play

arXiv:2605.0812995.9
AI Analysis

For researchers in multimodal AI and interactive agents, this work addresses the underexplored problem of joint customization across modalities, providing a foundation for next-generation characterful agents.

The paper introduces Customized Multimodal Role-Play (CMRP), a new task for jointly customizing a character's persona, dialogue style, and visual identity. The proposed UniCharacter framework, using a two-stage training with Unified-SFT and Character-GRPO, achieves coherent text and image generation with only 10 images and 100 GPU hours, substantially outperforming prior approaches on the RoleScape-20 dataset.

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes