Monster: A Unified Model For Motion, Scene, Text Retrieval
2025 Β· Luca Collorone, Matteo Gioia, Massimiliano Pappa, et al.
Abstract
Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-sh
Authors
(none)
Tags
Stats
Related papers
- Text-to-motion Retrieval: Towards Joint Understanding Of Human Motion Data And Natural Language (2023)11.94
- Tri-modal Motion Retrieval By Learning A Joint Embedding Space (2024)7.81
- Stacmr: Scene-text Aware Cross-modal Retrieval (2020)10.48
- MSTAR: Box-free Multi-query Scene Text Retrieval With Attention Recycling (2025)2.00
- Mocha: Denoising Caption Supervision For Motion-text Retrieval (2026)0.00
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Hybrid, Unified And Iterative: A Novel Framework For Text-based Person Anomaly Retrieval (2025)0.00