Rapverse: Coherent Vocals And Whole-body Motions Generations From Text
2024 Β· Jiaben Chen, Xin Yan, Yihang Chen, et al.
Abstract
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in
Authors
(none)
Tags
Stats
Related papers
- Drop The Beat! Freestyler For Accompaniment Conditioned Rapping Voice Generation (2024)2.26
- Vevo2: A Unified And Controllable Framework For Speech And Singing Voice Generation (2025)0.00
- Recom: Realistic Co-speech Motion Generation With Recurrent Embedded Transformer (2025)0.00
- Talkverse: Democratizing Minute-long Audio-driven Video Generation (2025)0.00
- Songgen: A Single Stage Auto-regressive Transformer For Text-to-song Generation (2025)4.98
- Cssinger: End-to-end Chunkwise Streaming Singing Voice Synthesis System Based On Conditional Variational Autoencoder (2024)0.00
- Real-time And Accurate: Zero-shot High-fidelity Singing Voice Conversion With Multi-condition Flow Synthesis (2024)0.00
- Mechanisms Of Multimodal Synchronization: Insights From Decoder-based Video-text-to-speech Synthesis (2024)0.00