CVNov 20, 2025Code
SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose ManipulationZhenyuan Qin, Xincheng Shuai, Henghui Ding
Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.
CVJan 2, 2025
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video GenerationXincheng Shuai, Henghui Ding, Zhenyuan Qin et al.
Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video.~To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.
CVNov 17, 2025
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D EngineXincheng Shuai, Zhenyuan Qin, Henghui Ding et al.
Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.