Teaching an Agent to Sketch One Part at a Time
This addresses the need for more controllable and editable sketch generation in creative or design applications, though it appears incremental as it builds on existing language model and reinforcement learning techniques with a new dataset.
The paper tackled the problem of generating vector sketches from text by developing a method that produces sketches one part at a time, using a multi-modal language model-based agent trained with multi-turn process-reward reinforcement learning, resulting in interpretable, controllable, and locally editable text-to-vector sketch generation.
We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.