CVMar 29, 2023
MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion PathQian Wang, Biao Zhang, Michael Birsak et al.
Image generation using diffusion can be controlled in multiple ways. In this paper, we systematically analyze the equations of modern generative diffusion networks to propose a framework, called MDP, that explains the design space of suitable manipulations. We identify 5 different manipulations, including intermediate latent, conditional embedding, cross attention maps, guidance, and predicted noise. We analyze the corresponding parameters of these manipulations and the manipulation schedule. We show that some previous editing methods fit nicely into our framework. Particularly, we identified one specific configuration as a new type of control by manipulating the predicted noise, which can perform higher-quality edits than previous work for a variety of local and global edits.
CVMar 26, 2023
BlobGAN-3D: A Spatially-Disentangled 3D-Aware Generative Model for Indoor ScenesQian Wang, Yiqun Wang, Michael Birsak et al.
3D-aware image synthesis has attracted increasing interest as it models the 3D nature of our real world. However, performing realistic object-level editing of the generated images in the multi-object scenario still remains a challenge. Recently, a 2D GAN termed BlobGAN has demonstrated great multi-object editing capabilities on real-world indoor scene datasets. In this work, we propose BlobGAN-3D, which is a 3D-aware improvement of the original 2D BlobGAN. We enable explicit camera pose control while maintaining the disentanglement for individual objects in the scene by extending the 2D blobs into 3D blobs. We keep the object-level editing capabilities of BlobGAN and in addition allow flexible control over the 3D location of the objects in the scene. We test our method on real-world indoor datasets and show that our method can achieve comparable image quality compared to the 2D BlobGAN and other 3D-aware GAN baselines while being able to enable camera pose control and object-level editing in the challenging multi-object real-world scenarios.
LGSep 1, 2022
Large-Scale Auto-Regressive Modeling Of Street NetworksMichael Birsak, Tom Kelly, Wamiq Para et al.
We present a novel generative method for the creation of city-scale road layouts. While the output of recent methods is limited in both size of the covered area and diversity, our framework produces large traversable graphs of high quality consisting of vertices and edges representing complete street networks covering 400 square kilometers or more. While our framework can process general 2D embedded graphs, we focus on street networks due to the wide availability of training data. Our generative framework consists of a transformer decoder that is used in a sliding window manner to predict a field of indices, with each index encoding a representation of the local neighborhood. The semantics of each index is determined by a dictionary of context vectors. The index field is then input to a decoder to compute the street graph. Using data from OpenStreetMap, we train our system on whole cities and even across large countries such as the US, and finally compare it to the state of the art.
CVJan 29Code
Geometry without Position? When Positional Embeddings Help and Hurt Spatial ReasoningJian Shi, Michael Birsak, Wenqing Cui et al.
This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes
AIJul 10, 2025Code
FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured RepresentationsFedor Rodionov, Abdelrahman Eldesokey, Michael Birsak et al.
We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.
79.5GRMar 24Code
Patchwork: A compact representation for 3D polygonal shapesRuichen Zheng, Biao Zhang, Michael Birsak et al.
We introduce Patchwork, a new general-purpose shape representation capable of modeling 2D and 3D geometry with a small number of parameters. Patchwork is grounded in a rigorous mathematical framework, providing provable complexity bounds and the ability to approximate arbitrary shapes with arbitrary precision in any dimension. We propose an efficient gradient-based optimization scheme to fit Patchwork representations to 2D and 3D data, along with a novel regularization loss that progressively prunes redundant elements, yielding high compactness after convergence. Our approach offers fast fitting performance, a fraction of the required parameters compared to existing alternatives, and native support for inside-outside classification, making it a versatile and compact representation for geometric learning and reconstruction tasks, with future potential for 3D generation. Our implementation is available at: https://github.com/Ankbzpx/patchwork-experiment.
CVMay 29, 2023Code
InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User InstructionsQian Wang, Biao Zhang, Michael Birsak et al.
Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can accept multiple forms of user instructions as input. We provide the code at https://github.com/QianWangX/InstructEdit.
CVJan 27, 2025
MatCLIP: Light- and Shape-Insensitive Assignment of PBR Material ModelsMichael Birsak, John Femiani, Biao Zhang et al.
Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released.