Robert W. Sumner

CV
h-index45
4papers
389citations
Novelty55%
AI Score42

4 Papers

CVJun 23, 2023
OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Ayça Takmaz, Elisabetta Fedele, Robert W. Sumner et al.

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

CVSep 27, 2024
Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner et al.

Open-vocabulary 3D segmentation enables exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d-segmentation.github.io.

CVFeb 2
VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann et al.

Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.

CYFeb 15, 2022
The Hitchhiker's Guide to Fused Twins: A Review of Access to Digital Twins in situ in Smart Cities

Jascha Grübel, Tyler Thrash, Leonel Aguilar et al.

Smart Cities already surround us, and yet they are still incomprehensibly far from directly impacting everyday life. While current Smart Cities are often inaccessible, the experience of everyday citizens may be enhanced with a combination of the emerging technologies Digital Twins (DTs) and Situated Analytics. DTs represent their Physical Twin (PT) in the real world via models, simulations, (remotely) sensed data, context awareness, and interactions. However, interaction requires appropriate interfaces to address the complexity of the city. Ultimately, leveraging the potential of Smart Cities requires going beyond assembling the DT to be comprehensive and accessible. Situated Analytics allows for the anchoring of city information in its spatial context. We advance the concept of embedding the DT into the PT through Situated Analytics to form Fused Twins (FTs). This fusion allows access to data in the location that it is generated in an embodied context that can make the data more understandable. Prototypes of FTs are rapidly emerging from different domains, but Smart Cities represent the context with the most potential for FTs in the future. This paper reviews DTs, Situated Analytics, and Smart Cities as the foundations of FTs. Regarding DTs, we define five components (Physical, Data, Analytical, Virtual, and Connection environments) that we relate to several cognates (i.e., similar but different terms) from existing literature. Regarding Situated Analytics, we review the effects of user embodiment on cognition and cognitive load. Finally, we classify existing partial examples of FTs from the literature and address their construction from Augmented Reality, Geographic Information Systems, Building/City Information Models, and DTs and provide an overview of future direction