CVMar 24, 2022
CLIP-Mesh: Generating textured meshes from text using pretrained image-text modelsNasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky et al. · mila
We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.
GROct 6, 2023
DragD3D: Realistic Mesh Editing with Rigidity Control Driven by 2D Diffusion PriorsTianhao Xie, Eugene Belilovsky, Sudhir Mudur et al.
Direct mesh editing and deformation are key components in the geometric modeling and animation pipeline. Mesh editing methods are typically framed as optimization problems combining user-specified vertex constraints with a regularizer that determines the position of the rest of the vertices. The choice of the regularizer is key to the realism and authenticity of the final result. Physics and geometry-based regularizers are not aware of the global context and semantics of the object, and the more recent deep learning priors are limited to a specific class of 3D object deformations. Our main contribution is a vertex-based mesh editing method called DragD3D based on (1) a novel optimization formulation that decouples the rotation and stretch components of the deformation and combines a 3D geometric regularizer with (2) the recently introduced DDS loss which scores the faithfulness of the rendered 2D image to one from a diffusion model. Thus, our deformation method achieves globally realistic shape deformation which is not restricted to any class of objects. Our new formulation optimizes directly the transformation of the neural Jacobian field explicitly separating the rotational and stretching components. The objective function of the optimization combines the approximate gradients of DDS and the gradients from the geometric loss to satisfy the vertex constraints. Additional user control over desired global shape deformation is made possible by allowing explicit per-triangle deformation control as well as explicit separation of rotational and stretching components of the deformation. We show that our deformations can be controlled to yield realistic shape deformations that are aware of the global context of the objects, and provide better results than just using geometric regularizers.
CVNov 19, 2024
Sketch-guided Cage-based 3D Gaussian Splatting DeformationTianhao Xie, Noam Aigerman, Eugene Belilovsky et al.
3D Gaussian Splatting (GS) is one of the most promising novel 3D representations that has received great interest in computer graphics and computer vision. While various systems have introduced editing capabilities for 3D GS, such as those guided by text prompts, fine-grained control over deformation remains an open challenge. In this work, we present a novel sketch-guided 3D GS deformation system that allows users to intuitively modify the geometry of a 3D GS model by drawing a silhouette sketch from a single viewpoint. Our approach introduces a new deformation method that combines cage-based deformations with a variant of Neural Jacobian Fields, enabling precise, fine-grained control. Additionally, it leverages large-scale 2D diffusion priors and ControlNet to ensure the generated deformations are semantically plausible. Through a series of experiments, we demonstrate the effectiveness of our method and showcase its ability to animate static 3D GS models as one of its key applications.
CVNov 19, 2024
Towards motion from video diffusion modelsPaul Janson, Tiberiu Popa, Eugene Belilovsky
Text-conditioned video diffusion models have emerged as a powerful tool in the realm of video generation and editing. But their ability to capture the nuances of human movement remains under-explored. Indeed the ability of these models to faithfully model an array of text prompts can lead to a wide host of applications in human and character animation. In this work, we take initial steps to investigate whether these models can effectively guide the synthesis of realistic human body animations. Specifically we propose to synthesize human motion by deforming an SMPL-X body representation guided by Score distillation sampling (SDS) calculated using a video diffusion model. By analyzing the fidelity of the resulting animations, we gain insights into the extent to which we can obtain motion using publicly available text-to-video diffusion models using SDS. Our findings shed light on the potential and limitations of these models for generating diverse and plausible human motions, paving the way for further research in this exciting area.
CVNov 28, 2025
FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian SplattingTianhao Xie, Linlian Jiang, Xinxin Zuo et al.
Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.
CVJun 23, 2025
End-to-End Fine-Tuning of 3D Texture Generation using Differentiable RewardsAmirHossein Zamani, Tianhao Xie, Amir G. Aghdam et al.
While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. We will make our implementation code publicly available upon acceptance of the paper.
CVJun 1, 2024
Temporally Consistent Object Editing in Videos using Extended AttentionAmirHossein Zamani, Amir G. Aghdam, Tiberiu Popa et al.
Image generation and editing have seen a great deal of advancements with the rise of large-scale diffusion models that allow user control of different modalities such as text, mask, depth maps, etc. However, controlled editing of videos still lags behind. Prior work in this area has focused on using 2D diffusion models to globally change the style of an existing video. On the other hand, in many practical applications, editing localized parts of the video is critical. In this work, we propose a method to edit videos using a pre-trained inpainting image diffusion model. We systematically redesign the forward path of the model by replacing the self-attention modules with an extended version of attention modules that creates frame-level dependencies. In this way, we ensure that the edited information will be consistent across all the video frames no matter what the shape and position of the masked area is. We qualitatively compare our results with state-of-the-art in terms of accuracy on several video editing tasks like object retargeting, object replacement, and object removal tasks. Simulations demonstrate the superior performance of the proposed strategy.
CVOct 6, 2017
CAMREP- Concordia Action and Motion RepositoryKaustubha Mendhurwar, Qing Gu, Vladimir de la Cruz et al.
Action recognition, motion classification, gait analysis and synthesis are fundamental problems in a number of fields such as computer graphics, bio-mechanics and human computer interaction that generate a large body of research. This type of data is complex because it is inherently multidimensional and has multiple modalities such as video, motion capture data, accelerometer data, etc. While some of this data, such as monocular video are easy to acquire, others are much more difficult and expensive such as motion capture data or multi-view video. This creates a large barrier of entry in the research community for data driven research. We have embarked on creating a new large repository of motion and action data (CAMREP) consisting of several motion and action databases. What makes this database unique is that we use a variety of modalities, enabling multi-modal analysis. Presently, the size of datasets varies with some having a large number of subjects while others having smaller numbers. We have also acquired long capture sequences in a number of cases, making some datasets rather large.