Renderers Are Good Zero-shot Representation Learners: Exploring Diffusion Latents For Metric Learning
2023 Β· Michael Tang, David Shustin
Abstract
Can the latent spaces of modern generative neural rendering models serve as representations for 3D-aware discriminative visual understanding tasks? We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E, including capturing view-independence and enabling the aggregation of scene representations from the representations of individual image views, and find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot, and is still competitive when both methods are trained using a contrative loss. These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world. Our code is available at https://github.com/michaelwilliamtang/golden-retriever.
Authors
(none)
Tags
Stats
Code
Related papers
- Diff-sbsr: Learning Multimodal Feature-enhanced Diffusion Models For Zero-shot Sketch-based 3D Shape Retrieval (2026)0.00
- Connecting Neural Models Latent Geometries With Relative Geodesic Representations (2025)0.00
- Deepdiffusion: Unsupervised Learning Of Retrieval-adapted Representations Via Diffusion-based Ranking On Latent Feature Manifold (2021)5.13
- Text-guided Synthesis Of Artistic Images With Retrieval-augmented Diffusion Models (2022)8.29
- Visual Explanation For Deep Metric Learning (2019)14.36
- Zero In On Shape: A Generic 2D-3D Instance Similarity Metric Learned From Synthetic Data (2021)5.84
- MV-RAG: Retrieval Augmented Multiview Diffusion (2025)0.00
- Beyond Averages: Open-vocabulary 3D Scene Understanding With Gaussian Splatting And Bag Of Embeddings (2025)0.00