Mechanisms Of Multimodal Synchronization: Insights From Decoder-based Video-text-to-speech Synthesis
2024 Β· Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, et al.
Abstract
Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis-a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both "global sequential indexing'' (unique position IDs across modalities) and "co-temporal ordered indexing'' (identical IDs for temporally
Authors
(none)
Tags
Stats
Related papers
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- More Than Words: In-the-wild Visually-driven Prosody For Text-to-speech (2021)9.03
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35
- Audio-sync Video Generation With Multi-stream Temporal Control (2025)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41