Multi-source Spatial Knowledge Understanding For Immersive Visual Text-to-speech
2024 Β· Shuwei He, Rui Liu
Abstract
Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source.This enriched interaction and integration of multi-source spatial knowledge guides the sp
Authors
(none)
Tags
Stats
Related papers
- Multi-modal And Multi-scale Spatial Environment Understanding For Immersive Visual Text-to-speech (2024)6.79
- I2TTS: Image-indicated Immersive Text-to-speech Synthesis With Spatial Perception (2024)0.00
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding (2023)8.82
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09
- Vit-tts: Visual Text-to-speech With Scalable Diffusion Transformer (2023)7.16