Poseembroider: Towards A 3D, Visual, Semantic-aware Human Pose Representation
2024 Β· Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, et al.
Abstract
Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person's pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities
Authors
(none)
Tags
Stats
Related papers
- Text-to-motion Retrieval: Towards Joint Understanding Of Human Motion Data And Natural Language (2023)11.94
- View-invariant, Occlusion-robust Probabilistic Embedding For Human Pose (2020)8.82
- V-VIPE: Variational View Invariant Pose Embedding (2024)2.26
- Y^2seq2seq: Cross-modal Representation Learning For 3D Shape And Text By Joint Reconstruction And Prediction Of View And Word Sequences (2018)12.02
- DISP6D: Disentangled Implicit Shape And Pose Learning For Scalable 6D Pose Estimation (2021)9.03
- See Finer, See More: Implicit Modality Alignment For Text-based Person Retrieval (2022)18.39
- Multiview-consistent Semi-supervised Learning For 3D Human Pose Estimation (2019)13.05
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21