CVJun 19, 2023

Renderers are Good Zero-Shot Representation Learners: Exploring Diffusion Latents for Metric Learning

arXiv:2306.10721v11.5h-index: 1Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of improving metric learning for 3D visual understanding, offering incremental insights into using generative models for discriminative tasks.

The paper investigates whether latent spaces from generative neural rendering models like Shap-E can serve as representations for 3D-aware discriminative tasks, finding that Shap-E outperforms EfficientNet zero-shot and remains competitive with training.

Can the latent spaces of modern generative neural rendering models serve as representations for 3D-aware discriminative visual understanding tasks? We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E, including capturing view-independence and enabling the aggregation of scene representations from the representations of individual image views, and find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot, and is still competitive when both methods are trained using a contrative loss. These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world. Our code is available at \url{https://github.com/michaelwilliamtang/golden-retriever}.

View on arXiv PDF Code

Similar