Text-to-motion Retrieval: Towards Joint Understanding Of Human Motion Data And Natural Language
2023 Β· Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, et al.
Abstract
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-a
Authors
(none)
Tags
Stats
Related papers
- Tri-modal Motion Retrieval By Learning A Joint Embedding Space (2024)7.81
- Poseembroider: Towards A 3D, Visual, Semantic-aware Human Pose Representation (2024)6.34
- Monster: A Unified Model For Motion, Scene, Text Retrieval (2025)0.00
- Mocha: Denoising Caption Supervision For Motion-text Retrieval (2026)0.00
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Lamp: Language-motion Pretraining For Motion Generation, Retrieval, And Captioning (2024)0.00
- TVPR: Text-to-video Person Retrieval And A New Benchmark (2023)2.26
- Deephums: Deep Human Motion Signature For 3D Skeletal Sequences (2019)2.26