CVAug 17, 2023Code
Spatially and Spectrally Consistent Deep Functional MapsMingze Sun, Shiwei Mao, Puhua Jiang et al.
Cycle consistency has long been exploited as a powerful prior for jointly optimizing maps within a collection of shapes. In this paper, we investigate its utility in the approaches of Deep Functional Maps, which are considered state-of-the-art in non-rigid shape matching. We first justify that under certain conditions, the learned maps, when represented in the spectral domain, are already cycle consistent. Furthermore, we identify the discrepancy that spectrally consistent maps are not necessarily spatially, or point-wise, consistent. In light of this, we present a novel design of unsupervised Deep Functional Maps, which effectively enforces the harmony of learned maps under the spectral and the point-wise representation. By taking advantage of cycle consistency, our framework produces state-of-the-art results in mapping shapes even under significant distortions. Beyond that, by independently estimating maps in both spectral and spatial domains, our method naturally alleviates over-fitting in network training, yielding superior generalization performance and accuracy within an array of challenging tests for both near-isometric and non-isometric datasets. Codes are available at https://github.com/rqhuang88/Spatiallyand-Spectrally-Consistent-Deep-Functional-Maps.
CVMar 29, 2023Code
Understanding and Improving Features Learned in Deep Functional MapsSouhaib Attaiki, Maks Ovsjanikov
Deep functional maps have recently emerged as a successful paradigm for non-rigid 3D shape correspondence tasks. An essential step in this pipeline consists in learning feature functions that are used as constraints to solve for a functional map inside the network. However, the precise nature of the information learned and stored in these functions is not yet well understood. Specifically, a major question is whether these features can be used for any other objective, apart from their purely algebraic role in solving for functional map matrices. In this paper, we show that under some mild conditions, the features learned within deep functional map approaches can be used as point-wise descriptors and thus are directly comparable across different shapes, even without the necessity of solving for a functional map at test time. Furthermore, informed by our analysis, we propose effective modifications to the standard deep functional map pipeline, which promote structural properties of learned features, significantly improving the matching results. Finally, we demonstrate that previously unsuccessful attempts at using extrinsic architectures for deep functional map feature extraction can be remedied via simple architectural changes, which encourage the theoretical properties suggested by our analysis. We thus bridge the gap between intrinsic and extrinsic surface-based learning, suggesting the necessary and sufficient conditions for successful shape matching. Our code is available at https://github.com/pvnieo/clover.
CVNov 26, 2022Code
Reduced Representation of Deformation Fields for Effective Non-rigid Shape MatchingRamana Sundararaman, Riccardo Marin, Emanuele Rodola et al.
In this work we present a novel approach for computing correspondences between non-rigid objects, by exploiting a reduced representation of deformation fields. Different from existing works that represent deformation fields by training a general-purpose neural network, we advocate for an approximation based on mesh-free methods. By letting the network learn deformation parameters at a sparse set of positions in space (nodes), we reconstruct the continuous deformation field in a closed-form with guaranteed smoothness. With this reduction in degrees of freedom, we show significant improvement in terms of data-efficiency thus enabling limited supervision. Furthermore, our approximation provides direct access to first-order derivatives of deformation fields, which facilitates enforcing desirable regularization effectively. Our resulting model has high expressive power and is able to capture complex deformations. We illustrate its effectiveness through state-of-the-art results across multiple deformable shape matching benchmarks. Our code and data are publicly available at: https://github.com/Sentient07/DeformationBasis.
CVNov 29, 2023
Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D FeaturesThomas Wimmer, Peter Wonka, Maks Ovsjanikov · eth-zurich
With the immense growth of dataset sizes and computing resources in recent years, so-called foundation models have become popular in NLP and vision tasks. In this work, we propose to explore foundation models for the task of keypoint detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geometric awareness while demanding high localization accuracy. To address this problem, we propose, first, to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we obtain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second, we employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset, almost doubling the performance of the previous best methods.
LGSep 28, 2023Code
AtomSurf : Surface Representation for Learning on Protein StructuresVincent Mallet, Souhaib Attaiki, Yangyang Miao et al.
While there has been significant progress in evaluating and comparing different representations for learning on protein data, the role of surface-based learning approaches remains not well-understood. In particular, there is a lack of direct and fair benchmark comparison between the best available surface-based learning methods against alternative representations such as graphs. Moreover, the few existing surface-based approaches either use surface information in isolation or, at best, perform global pooling between surface and graph-based architectures. In this work, we fill this gap by first adapting a state-of-the-art surface encoder for protein learning tasks. We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices across all layers. We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, while adhering to the strict benchmark protocol, as well as more broadly on binding site identification and binding pocket classification. Furthermore, we use coarsened surfaces and optimize our approach for efficiency, making our tool competitive in training and inference time with existing techniques. Code can be found online: https://github.com/Vincentx15/atomsurf
CVOct 4, 2022
Affection: Learning Affective Explanations for Real-World Visual DataPanos Achlioptas, Maks Ovsjanikov, Leonidas Guibas et al.
In this work, we explore the emotional reactions that real-world images tend to induce by using natural language as the medium to express the rationale behind an affective response to a given visual stimulus. To embark on this journey, we introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt in a particular way when observing a specific image, producing a total of 526,749 responses. Even though emotional reactions are subjective and sensitive to context (personal mood, social status, past experiences) - we show that there is significant common ground to capture potentially plausible emotional responses with a large support in the subject population. In light of this crucial observation, we ask the following questions: i) Can we develop multi-modal neural networks that provide reasonable affective responses to real-world visual data, explained with language? ii) Can we steer such methods towards producing explanations with varying degrees of pragmatic language or justifying different emotional reactions while adapting to the underlying visual stimulus? Finally, iii) How can we evaluate the performance of such methods for this novel task? With this work, we take the first steps in addressing all of these questions, thus paving the way for richer, more human-centric, and emotionally-aware image analysis systems. Our introduced dataset and all developed methods are available on https://affective-explanations.org
CVApr 11, 2023
SATR: Zero-Shot Semantic Segmentation of 3D ShapesAhmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov et al.
We explore the task of zero-shot semantic segmentation of 3D shapes by using large-scale off-the-shelf 2D image recognition models. Surprisingly, we find that modern zero-shot 2D object detectors are better suited for this task than contemporary text/image similarity predictors or even zero-shot 2D segmentation networks. Our key finding is that it is possible to extract accurate 3D segmentation maps from multi-view bounding box predictions by using the topological properties of the underlying surface. For this, we develop the Segmentation Assignment with Topological Reweighting (SATR) algorithm and evaluate it on ShapeNetPart and our proposed FAUST benchmarks. SATR achieves state-of-the-art performance and outperforms a baseline algorithm by 1.3% and 4% average mIoU on the FAUST coarse and fine-grained benchmarks, respectively, and by 5.2% average mIoU on the ShapeNetPart benchmark. Our source code and data will be publicly released. Project webpage: https://samir55.github.io/SATR/.
CVOct 5, 2022
Smooth Non-Rigid Shape Matching via Effective Dirichlet Energy OptimizationRobin Magnet, Jing Ren, Olga Sorkine-Hornung et al.
We introduce pointwise map smoothness via the Dirichlet energy into the functional map pipeline, and propose an algorithm for optimizing it efficiently, which leads to high-quality results in challenging settings. Specifically, we first formulate the Dirichlet energy of the pulled-back shape coordinates, as a way to evaluate smoothness of a pointwise map across discrete surfaces. We then extend the recently proposed discrete solver and show how a strategy based on auxiliary variable reformulation allows us to optimize pointwise map smoothness alongside desirable functional map properties such as bijectivity. This leads to an efficient map refinement strategy that simultaneously improves functional and point-to-point correspondences, obtaining smooth maps even on non-isometric shape pairs. Moreover, we demonstrate that several previously proposed methods for computing smooth maps can be reformulated as variants of our approach, which allows us to compare different formulations in a consistent framework. Finally, we compare these methods both on existing benchmarks and on a new rich dataset that we introduce, which contains non-rigid, non-isometric shape pairs with inter-category and cross-category correspondences. Our work leads to a general framework for optimizing and analyzing map smoothness both conceptually and in challenging practical settings.
CVJun 5, 2023
Zero-Shot 3D Shape CorrespondenceAhmed Abdelreheem, Abdelrahman Eldesokey, Maks Ovsjanikov et al.
We propose a novel zero-shot approach to computing correspondences between 3D shapes. Existing approaches mainly focus on isometric and near-isometric shape pairs (e.g., human vs. human), but less attention has been given to strongly non-isometric and inter-class shape matching (e.g., human vs. cow). To this end, we introduce a fully automatic method that exploits the exceptional reasoning capabilities of recent foundation models in language and vision to tackle difficult shape correspondence problems. Our approach comprises multiple stages. First, we classify the 3D shapes in a zero-shot manner by feeding rendered shape views to a language-vision model (e.g., BLIP2) to generate a list of class proposals per shape. These proposals are unified into a single class per shape by employing the reasoning capabilities of ChatGPT. Second, we attempt to segment the two shapes in a zero-shot manner, but in contrast to the co-segmentation problem, we do not require a mutual set of semantic regions. Instead, we propose to exploit the in-context learning capabilities of ChatGPT to generate two different sets of semantic regions for each shape and a semantic mapping between them. This enables our approach to match strongly non-isometric shapes with significant differences in geometric structure. Finally, we employ the generated semantic mapping to produce coarse correspondences that can further be refined by the functional maps framework to produce dense point-to-point maps. Our approach, despite its simplicity, produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes. Project webpage: https://samir55.github.io/3dshapematch/.
CVApr 28, 2022
Deep Orientation-Aware Functional Maps: Tackling Symmetry Issues in Shape MatchingNicolas Donati, Etienne Corman, Maks Ovsjanikov
State-of-the-art fully intrinsic networks for non-rigid shape matching often struggle to disambiguate the symmetries of the shapes leading to unstable correspondence predictions. Meanwhile, recent advances in the functional map framework allow to enforce orientation preservation using a functional representation for tangent vector field transfer, through so-called complex functional maps. Using this representation, we propose a new deep learning approach to learn orientation-aware features in a fully unsupervised setting. Our architecture is built on top of DiffusionNet, making it robust to discretization changes. Additionally, we introduce a vector field-based loss, which promotes orientation preservation without using (often unstable) extrinsic descriptors.
CVOct 12, 2022
Learning Multi-resolution Functional Maps with Spectral Attention for Robust Shape MatchingLei Li, Nicolas Donati, Maks Ovsjanikov
In this work, we present a novel non-rigid shape matching framework based on multi-resolution functional maps with spectral attention. Existing functional map learning methods all rely on the critical choice of the spectral resolution hyperparameter, which can severely affect the overall accuracy or lead to overfitting, if not chosen carefully. In this paper, we show that spectral resolution tuning can be alleviated by introducing spectral attention. Our framework is applicable in both supervised and unsupervised settings, and we show that it is possible to train the network so that it can adapt the spectral resolution, depending on the given shape input. More specifically, we propose to compute multi-resolution functional maps that characterize correspondence across a range of spectral resolutions, and introduce a spectral attention network that helps to combine this representation into a single coherent final correspondence. Our approach is not only accurate with near-isometric input, for which a high spectral resolution is typically preferred, but also robust and able to produce reasonable matching even in the presence of significant non-isometric distortion, which poses great challenges to existing methods. We demonstrate the superior performance of our approach through experiments on a suite of challenging near-isometric and non-isometric shape matching benchmarks.
CVMay 10, 2022
Non-Isometric Shape Matching via Functional Maps on Landmark-Adapted BasesMikhail Panine, Maxime Kirgo, Maks Ovsjanikov
We propose a principled approach for non-isometric landmark-preserving non-rigid shape matching. Our method is based on the functional maps framework, but rather than promoting isometries we focus instead on near-conformal maps that preserve landmarks exactly. We achieve this, first, by introducing a novel landmark-adapted basis using an intrinsic Dirichlet-Steklov eigenproblem. Second, we establish the functional decomposition of conformal maps expressed in this basis. Finally, we formulate a conformally-invariant energy that promotes high-quality landmark-preserving maps, and show how it can be solved via a variant of the recently proposed ZoomOut method that we extend to our setting. Our method is descriptor-free, efficient and robust to significant mesh variability. We evaluate our approach on a range of benchmark datasets and demonstrate state-of-the-art performance on non-isometric benchmarks and near state-of-the-art performance on isometric ones.
CVMar 15, 2022
Implicit field supervision for robust non-rigid shape matchingRamana Sundararaman, Gautam Pai, Maks Ovsjanikov
Establishing a correspondence between two non-rigidly deforming shapes is one of the most fundamental problems in visual computing. Existing methods often show weak resilience when presented with challenges innate to real-world data such as noise, outliers, self-occlusion etc. On the other hand, auto-decoders have demonstrated strong expressive power in learning geometrically meaningful latent embeddings. However, their use in \emph{shape analysis} has been limited. In this paper, we introduce an approach based on an auto-decoder framework, that learns a continuous shape-wise deformation field over a fixed template. By supervising the deformation field for points on-surface and regularising for points off-surface through a novel \emph{Signed Distance Regularisation} (SDR), we learn an alignment between the template and shape \emph{volumes}. Trained on clean water-tight meshes, \emph{without} any data-augmentation, we demonstrate compelling performance on compromised data and real-world scans.
LGDec 5, 2022
TIDE: Time Derivative Diffusion for Deep Learning on GraphsMaysam Behmanesh, Maximilian Krahn, Maks Ovsjanikov
A prominent paradigm for graph neural networks is based on the message-passing framework. In this framework, information communication is realized only between neighboring nodes. The challenge of approaches that use this paradigm is to ensure efficient and accurate long-distance communication between nodes, as deep convolutional networks are prone to oversmoothing. In this paper, we present a novel method based on time derivative graph diffusion (TIDE) to overcome these structural limitations of the message-passing framework. Our approach allows for optimizing the spatial extent of diffusion across various tasks and network channels, thus enabling medium and long-distance communication efficiently. Furthermore, we show that our architecture design also enables local message-passing and thus inherits from the capabilities of local message-passing approaches. We show that on both widely used graph benchmarks and synthetic mesh and graph datasets, the proposed framework outperforms state-of-the-art methods by a significant margin
CGNov 29, 2022
Equivalence Between SE(3) Equivariant Networks via Steerable Kernels and Group ConvolutionAdrien Poulenard, Maks Ovsjanikov, Leonidas J. Guibas
A wide range of techniques have been proposed in recent years for designing neural networks for 3D data that are equivariant under rotation and translation of the input. Most approaches for equivariance under the Euclidean group $\mathrm{SE}(3)$ of rotations and translations fall within one of the two major categories. The first category consists of methods that use $\mathrm{SE}(3)$-convolution which generalizes classical $\mathbb{R}^3$-convolution on signals over $\mathrm{SE}(3)$. Alternatively, it is possible to use \textit{steerable convolution} which achieves $\mathrm{SE}(3)$-equivariance by imposing constraints on $\mathbb{R}^3$-convolution of tensor fields. It is known by specialists in the field that the two approaches are equivalent, with steerable convolution being the Fourier transform of $\mathrm{SE}(3)$ convolution. Unfortunately, these results are not widely known and moreover the exact relations between deep learning architectures built upon these two approaches have not been precisely described in the literature on equivariant deep learning. In this work we provide an in-depth analysis of both methods and their equivalence and relate the two constructions to multiview convolutional networks. Furthermore, we provide theoretical justifications of separability of $\mathrm{SE}(3)$ group convolution, which explain the applicability and success of some recent approaches. Finally, we express different methods using a single coherent formalism and provide explicit formulas that relate the kernels learned by different methods. In this way, our work helps to unify different previously-proposed techniques for achieving roto-translational equivariance, and helps to shed light on both the utility and precise differences between various alternatives. We also derive new TFN non-linearities from our equivalence principle and test them on practical benchmark datasets.
CVSep 26, 2024
DeBaRA: Denoising-Based 3D Room Arrangement GenerationLéopold Maillard, Nicolas Sereyjol-Garros, Tom Durand et al.
Generating realistic and diverse layouts of furnished indoor 3D scenes unlocks multiple interactive applications impacting a wide range of industries. The inherent complexity of object interactions, the limited amount of available data and the requirement to fulfill spatial constraints all make generative modeling for 3D scene synthesis and arrangement challenging. Current methods address these challenges autoregressively or by using off-the-shelf diffusion objectives by simultaneously predicting all attributes without 3D reasoning considerations. In this paper, we introduce DeBaRA, a score-based model specifically tailored for precise, controllable and flexible arrangement generation in a bounded environment. We argue that the most critical component of a scene synthesis system is to accurately establish the size and position of various objects within a restricted area. Based on this insight, we propose a lightweight conditional score-based model designed with 3D spatial awareness at its core. We demonstrate that by focusing on spatial attributes of objects, a single trained DeBaRA model can be leveraged at test time to perform several downstream applications such as scene synthesis, completion and re-arrangement. Further, we introduce a novel Self Score Evaluation procedure so it can be optimally employed alongside external LLM models. We evaluate our approach through extensive experiments and demonstrate significant improvement upon state-of-the-art approaches in a range of scenarios.
81.8LGMar 10
Generative Drifting is Secretly Score Matching: a Spectral and Variational PerspectiveErkan Turan, Maks Ovsjanikov
Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $σ(t)=σ_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
CVSep 16, 2022
SRFeat: Learning Locally Accurate and Globally Consistent Non-Rigid Shape CorrespondenceLei Li, Souhaib Attaiki, Maks Ovsjanikov
In this work, we present a novel learning-based framework that combines the local accuracy of contrastive learning with the global consistency of geometric approaches, for robust non-rigid matching. We first observe that while contrastive learning can lead to powerful point-wise features, the learned correspondences commonly lack smoothness and consistency, owing to the purely combinatorial nature of the standard contrastive losses. To overcome this limitation we propose to boost contrastive feature learning with two types of smoothness regularization that inject geometric information into correspondence learning. With this novel combination in hand, the resulting features are both highly discriminative across individual points, and, at the same time, lead to robust and consistent correspondences, through simple proximity queries. Our framework is general and is applicable to local feature learning in both the 3D and 2D domains. We demonstrate the superiority of our approach through extensive experiments on a wide range of challenging matching benchmarks, including 3D non-rigid shape correspondence and 2D image keypoint matching.
42.5CVApr 24
FILTR: Extracting Topological Features from Pretrained 3D ModelsLouis Martinez, Maks Ovsjanikov
Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
56.7CVApr 16
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution ShapesVictoria Yue Chen, Emery Pierson, Léopold Maillard et al.
Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts
CVOct 27, 2023
Unsupervised Representation Learning for Diverse Deformable Shape CollectionsSara Hahner, Souhaib Attaiki, Jochen Garcke et al.
We introduce a novel learning-based method for encoding and manipulating 3D surface meshes. Our method is specifically designed to create an interpretable embedding space for deformable shape collections. Unlike previous 3D mesh autoencoders that require meshes to be in a 1-to-1 correspondence, our approach is trained on diverse meshes in an unsupervised manner. Central to our method is a spectral pooling technique that establishes a universal latent space, breaking free from traditional constraints of mesh connectivity and shape categories. The entire process consists of two stages. In the first stage, we employ the functional map paradigm to extract point-to-point (p2p) maps between a collection of shapes in an unsupervised manner. These p2p maps are then utilized to construct a common latent space, which ensures straightforward interpretation and independence from mesh connectivity and shape category. Through extensive experiments, we demonstrate that our method achieves excellent reconstructions and produces more realistic and smoother interpolations than baseline approaches.
CVDec 22, 2025
Unified Brain Surface and Volume RegistrationS. Mazdak Abulnaga, Andrew Hoopes, Malte Hoffmann et al.
Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers $3$D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods -- improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.
CVDec 3, 2025
Unique Lives, Shared World: Learning from Single-Life VideosTengda Han, Sayna Ebrahimi, Dilara Gokay et al.
We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
LGFeb 17
Beyond ReLU: Bifurcation, Oversmoothing, and Topological PriorsErkan Turan, Gaspard Abel, Maysam Behmanesh et al.
Graph Neural Networks (GNNs) learn node representations through iterative network-based message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emph{bifurcation theory} perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.'' Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emph{patterns} that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.
CVNov 4, 2025
Dynamic Reflections: Probing Video Representations with Text AlignmentTyler Zhu, Tengda Han, Leonidas Guibas et al.
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/
CVMar 11, 2024Code
Shape Non-rigid Kinematics (SNK): A Zero-Shot Method for Non-Rigid Shape Matching via Unsupervised Functional Map Regularized ReconstructionSouhaib Attaiki, Maks Ovsjanikov
We present Shape Non-rigid Kinematics (SNK), a novel zero-shot method for non-rigid shape matching that eliminates the need for extensive training or ground truth data. SNK operates on a single pair of shapes, and employs a reconstruction-based strategy using an encoder-decoder architecture, which deforms the source shape to closely match the target shape. During the process, an unsupervised functional map is predicted and converted into a point-to-point map, serving as a supervisory mechanism for the reconstruction. To aid in training, we have designed a new decoder architecture that generates smooth, realistic deformations. SNK demonstrates competitive results on traditional benchmarks, simplifying the shape-matching process without compromising accuracy. Our code can be found online: https://github.com/pvnieo/SNK
CVJul 4, 2025Code
SciVid: Cross-Domain Evaluation of Video Models in Scientific ApplicationsYana Hasson, Pauline Luc, Liliane Momeni et al.
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.
CVMar 7, 2025Code
Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent SpacesSouhail Hadgi, Luca Moschella, Andrea Santilli et al.
Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations. Our code and weights are available at https://github.com/Souhail-01/3d-text-alignment
CVJul 31, 2025Code
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape MatchingEmery Pierson, Lei Li, Angela Dai et al.
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/
CVMar 27, 2023Code
Generalizable Local Feature Pre-training for Deformable Shape AnalysisSouhaib Attaiki, Lei Li, Maks Ovsjanikov
Transfer learning is fundamental for addressing problems in settings with little training data. While several transfer learning approaches have been proposed in 3D, unfortunately, these solutions typically operate on an entire 3D object or even scene-level and thus, as we show, fail to generalize to new classes, such as deformable organic shapes. In addition, there is currently a lack of understanding of what makes pre-trained features transferable across significantly different 3D shape categories. In this paper, we make a step toward addressing these challenges. First, we analyze the link between feature locality and transferability in tasks involving deformable 3D objects, while also comparing different backbones and losses for local feature pre-training. We observe that with proper training, learned features can be useful in such tasks, but, crucially, only with an appropriate choice of the receptive field size. We then propose a differentiable method for optimizing the receptive field within 3D transfer learning. Jointly, this leads to the first learnable features that can successfully generalize to unseen classes of 3D shapes such as humans and animals. Our extensive experiments show that this approach leads to state-of-the-art results on several downstream tasks such as segmentation, shape correspondence, and classification. Our code is available at \url{https://github.com/pvnieo/vader}.
CVJan 14, 2023Code
NCP: Neural Correspondence Prior for Effective Unsupervised Shape MatchingSouhaib Attaiki, Maks Ovsjanikov
We present Neural Correspondence Prior (NCP), a new paradigm for computing correspondences between 3D shapes. Our approach is fully unsupervised and can lead to high-quality correspondences even in challenging cases such as sparse point clouds or non-isometric meshes, where current methods fail. Our first key observation is that, in line with neural priors observed in other domains, recent network architectures on 3D data, even without training, tend to produce pointwise features that induce plausible maps between rigid or non-rigid shapes. Secondly, we show that given a noisy map as input, training a feature extraction network with the input map as supervision tends to remove artifacts from the input and can act as a powerful correspondence denoising mechanism, both between individual pairs and within a collection. With these observations in hand, we propose a two-stage unsupervised paradigm for shape matching by (i) performing unsupervised training by adapting an existing approach to obtain an initial set of noisy matches, and (ii) using these matches to train a network in a supervised manner. We demonstrate that this approach significantly improves the accuracy of the maps, especially when trained within a collection. We show that NCP is data-efficient, fast, and achieves state-of-the-art results on many tasks. Our code can be found online: https://github.com/pvnieo/NCP.
CVOct 19, 2021Code
DPFM: Deep Partial Functional MapsSouhaib Attaiki, Gautam Pai, Maks Ovsjanikov
We consider the problem of computing dense correspondences between non-rigid shapes with potentially significant partiality. Existing formulations tackle this problem through heavy manifold optimization in the spectral domain, given hand-crafted shape descriptors. In this paper, we propose the first learning method aimed directly at partial non-rigid shape correspondence. Our approach uses the functional map framework, can be trained in a supervised or unsupervised manner, and learns descriptors directly from the data, thus both improving robustness and accuracy in challenging cases. Furthermore, unlike existing techniques, our method is also applicable to partial-to-partial non-rigid matching, in which the common regions on both shapes are unknown a priori. We demonstrate that the resulting method is data-efficient, and achieves state-of-the-art results on several benchmark datasets. Our code and data can be found online: https://github.com/pvnieo/DPFM
CVSep 22, 2021Code
Differentiable Surface TriangulationMarie-Julie Rakotosaona, Noam Aigerman, Niloy Mitra et al.
Triangle meshes remain the most popular data representation for surface geometry. This ubiquitous representation is essentially a hybrid one that decouples continuous vertex locations from the discrete topological triangulation. Unfortunately, the combinatorial nature of the triangulation prevents taking derivatives over the space of possible meshings of any given surface. As a result, to date, mesh processing and optimization techniques have been unable to truly take advantage of modular gradient descent components of modern optimization frameworks. In this work, we present a differentiable surface triangulation that enables optimization for any per-vertex or per-face differentiable objective function over the space of underlying surface triangulations. Our method builds on the result that any 2D triangulation can be achieved by a suitably perturbed weighted Delaunay triangulation. We translate this result into a computational algorithm by proposing a soft relaxation of the classical weighted Delaunay triangulation and optimizing over vertex weights and vertex locations. We extend the algorithm to 3D by decomposing shapes into developable sets and differentiably meshing each set with suitable boundary constraints. We demonstrate the efficacy of our method on various planar and surface meshes on a range of difficult-to-optimize objective functions. Our code can be found online: https://github.com/mrakotosaon/diff-surface-triangulation.
CVDec 2, 2020Code
Learning Delaunay Surface Elements for Mesh ReconstructionMarie-Julie Rakotosaona, Paul Guerrero, Noam Aigerman et al.
We present a method for reconstructing triangle meshes from point clouds. Existing learning-based methods for mesh reconstruction mostly generate triangles individually, making it hard to create manifold meshes. We leverage the properties of 2D Delaunay triangulations to construct a mesh from manifold surface elements. Our method first estimates local geodesic neighborhoods around each point. We then perform a 2D projection of these neighborhoods using a learned logarithmic map. A Delaunay triangulation in this 2D domain is guaranteed to produce a manifold patch, which we call a Delaunay surface element. We synchronize the local 2D projections of neighboring elements to maximize the manifoldness of the reconstructed mesh. Our results show that we achieve better overall manifoldness of our reconstructed meshes than current methods to reconstruct meshes with arbitrary topology. Our code, data and pretrained models can be found online: https://github.com/mrakotosaon/dse-meshing
LGSep 29, 2020Code
Geometric Matrix Completion: A Functional ViewAbhishek Sharma, Maks Ovsjanikov
We propose a totally functional view of geometric matrix completion problem. Differently from existing work, we propose a novel regularization inspired from the functional map literature that is more interpretable and theoretically sound. On synthetic tasks with strong underlying geometric structure, our framework outperforms state of the art by a huge margin (two order of magnitude) demonstrating the potential of our approach. On real datasets, we achieve state-of-the-art results at a fraction of the computational effort of previous methods. Our code is publicly available at https://github.com/Not-IITian/functional-matrix-completion
CVSep 28, 2020Code
Weakly Supervised Deep Functional Map for Shape MatchingAbhishek Sharma, Maks Ovsjanikov
A variety of deep functional maps have been proposed recently, from fully supervised to totally unsupervised, with a range of loss functions as well as different regularization terms. However, it is still not clear what are minimum ingredients of a deep functional map pipeline and whether such ingredients unify or generalize all recent work on deep functional maps. We show empirically minimum components for obtaining state of the art results with different loss functions, supervised as well as unsupervised. Furthermore, we propose a novel framework designed for both full-to-full as well as partial to full shape matching that achieves state of the art results on several benchmark datasets outperforming even the fully supervised methods by a significant margin. Our code is publicly available at https://github.com/Not-IITian/Weakly-supervised-Functional-map
MLMar 31, 2020Code
Deep Geometric Functional Maps: Robust Feature Learning for Shape CorrespondenceNicolas Donati, Abhishek Sharma, Maks Ovsjanikov
We present a novel learning-based approach for computing correspondences between non-rigid 3D shapes. Unlike previous methods that either require extensive training data or operate on handcrafted input descriptors and thus generalize poorly across diverse datasets, our approach is both accurate and robust to changes in shape structure. Key to our method is a feature-extraction network that learns directly from raw shape geometry, combined with a novel regularized map extraction layer and loss, based on the functional map representation. We demonstrate through extensive experiments in challenging shape matching scenarios that our method can learn from less training data than existing supervised approaches and generalizes significantly better than current descriptor-based learning methods. Our source code is available at: https://github.com/LIX-shape-analysis/GeomFmaps.
CVMar 30, 2024
Memory-Scalable and Simplified Functional Map LearningRobin Magnet, Maks Ovsjanikov
Deep functional maps have emerged in recent years as a prominent learning-based framework for non-rigid shape matching problems. While early methods in this domain only focused on learning in the functional domain, the latest techniques have demonstrated that by promoting consistency between functional and pointwise maps leads to significant improvements in accuracy. Unfortunately, existing approaches rely heavily on the computation of large dense matrices arising from soft pointwise maps, which compromises their efficiency and scalability. To address this limitation, we introduce a novel memory-scalable and efficient functional map learning pipeline. By leveraging the specific structure of functional maps, we offer the possibility to achieve identical results without ever storing the pointwise map in memory. Furthermore, based on the same approach, we present a differentiable map refinement layer adapted from an existing axiomatic refinement algorithm. Unlike many functional map learning methods, which use this algorithm at a post-processing step, ours can be easily used at train time, enabling to enforce consistency between the refined and initial versions of the map. Our resulting approach is both simpler, more efficient and more numerically stable, by avoiding differentiation through a linear system, while achieving close to state-of-the-art results in challenging scenarios.
CVJun 30, 2025
MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface ReconstructionAntoine Guédon, Diego Gomez, Nissim Maruani et al.
While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.
CVDec 9, 2024
ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language ModelsBingchen Gong, Diego Gomez, Abdullah Hamdi et al.
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
57.0CVApr 8
From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented GaussiansDiego Gomez, Antoine Guédon, Nissim Maruani et al.
3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes.
LGJun 27, 2025
Unfolding Generative Flows with Koopman Operators: Fast and Interpretable SamplingErkan Turan, Aristotelis Siozopoulos, Louis Martinez et al.
Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by slow sampling: producing a single sample requires solving a nonlinear ODE with hundreds of function evaluations. Recent approaches such as Rectified Flow and OT-CFM accelerate sampling by straightening trajectories, yet the learned dynamics remain nonlinear black boxes, limiting both efficiency and interpretability. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. This yields two key benefits. First, sampling becomes one-step and parallelizable, computed in closed form via the matrix exponential. Second, the Koopman operator provides a spectral blueprint of generation, enabling novel interpretability through its eigenvalues and modes. We derive a practical, simulation-free training objective that enforces infinitesimal consistency with the teacher's dynamics and show that this alignment preserves fidelity along the full generative path, distinguishing our method from boundary-only distillation. Empirically, our approach achieves competitive sample quality with dramatic speedups, while uniquely enabling spectral analysis of generative flows.
CVFeb 3, 2025
FourieRF: Few-Shot NeRFs via Progressive Fourier Frequency ControlDiego Gomez, Bingchen Gong, Maks Ovsjanikov
In this work, we introduce FourieRF, a novel approach for achieving fast and high-quality reconstruction in the few-shot setting. Our method effectively parameterizes features through an explicit curriculum training procedure, incrementally increasing scene complexity during optimization. Experimental results show that the prior induced by our approach is both robust and adaptable across a wide variety of scenes, establishing FourieRF as a strong and versatile baseline for the few-shot rendering problem. While our approach significantly reduces artifacts, it may still lead to reconstruction errors in severely under-constrained scenarios, particularly where view occlusion leaves parts of the shape uncovered. In the future, our method could be enhanced by integrating foundation models to complete missing parts using large data-driven priors.
CVDec 21, 2024
GANFusion: Feed-Forward Text-to-3D with Diffusion in GAN SpaceSouhaib Attaiki, Paul Guerrero, Duygu Ceylan et al.
We train a feed-forward text-to-3D diffusion generator for human characters using only single-view 2D data for supervision. Existing 3D generative models cannot yet match the fidelity of image or video generative models. State-of-the-art 3D generators are either trained with explicit 3D supervision and are thus limited by the volume and diversity of existing 3D data. Meanwhile, generators that can be trained with only 2D data as supervision typically produce coarser results, cannot be text-conditioned, or must revert to test-time optimization. We observe that GAN- and diffusion-based generators have complementary qualities: GANs can be trained efficiently with 2D supervision to produce high-quality 3D objects but are hard to condition on text. In contrast, denoising diffusion models can be conditioned efficiently but tend to be hard to train with only 2D supervision. We introduce GANFusion, which starts by generating unconditional triplane features for 3D data using a GAN architecture trained with only single-view 2D data. We then generate random samples from the GAN, caption them, and train a text-conditioned diffusion model that directly learns to sample from the space of good triplane features that can be decoded into 3D objects.
CVDec 3, 2024
PaNDaS: Learnable Deformation Modeling with Localized ControlThomas Besnier, Emery Pierson, Sylvain Arguillere et al.
Non-rigid shape deformations pose significant challenges, and most existing methods struggle to handle partial deformations effectively. We propose to learn deformations at the point level, which allows for localized control of 3D surface meshes, enabling Partial Non-rigid Deformations and interpolations of Surfaces (PaNDaS). Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Moreover, one can mix and combine various poses from the database, all while not requiring any optimization at inference time. We demonstrate state-of-the-art accuracy and greater locality for shape reconstruction and interpolation compared to approaches relying on global shape representation across various types of human surface data. We also demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations. Code and data will be made available after the reviewing process.
86.9CVMar 31
SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D ScenesLéopold Maillard, Francis Engelmann, Tom Durand et al.
Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.
CVJan 5
PatchAlign3D: Local Feature Alignment for Dense 3D Shape understandingSouhail Hadgi, Bingchen Gong, Ramana Sundararaman et al.
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks. Project website: https://souhail-hadgi.github.io/patchalign3dsite/
LGSep 11, 2025
Graph Alignment via Dual-Pass Spectral Encoding and Latent Space CommunicationMaysam Behmanesh, Erkan Turan, Maks Ovsjanikov
Graph alignment, the problem of identifying corresponding nodes across multiple graphs, is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.
CVJul 18, 2025
Generalist Forecasting with Frozen Video Models via Latent DiffusionJacob C Walker, Pedro Vélez, Luisa Polania Cabrera et al.
Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
CVJul 4, 2025
LACONIC: A 3D Layout Adapter for Controllable Image CreationLéopold Maillard, Tom Durand, Adrien Ramanana Rahary et al.
Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.