Instruction-based Image Manipulation by Watching How Things Move
This work addresses the challenge of scalable and natural instruction-based image manipulation for computer vision applications, though it is incremental as it builds on existing MLLM and video data methods.
The paper tackles the problem of generating diverse and realistic image editing instructions by constructing a dataset from video frames using multimodal large language models, resulting in a model that achieves state-of-the-art performance in complex manipulations like adjusting poses and altering perspectives.
This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.