Advancing Nam-to-speech Conversion With Novel Methods And The Multinam Dataset
2024 Β· Neil Shah, Shirish Karande, Vineet Gandhi
Abstract
Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and t
Authors
(none)
Tags
Stats
Related papers
- Towards Improving Nam-to-speech Synthesis Intelligibility Using Self-supervised Speech Models (2024)5.84
- Speaker Verification-derived Loss And Data Augmentation For Dnn-based Multispeaker Speech Synthesis (2021)3.58
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Whispervc: Decoupled Cross-domain Alignment And Speech Generation For Low-resource Whisper-to-normal Conversion (2025)0.00
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- Adapting TTS Models For New Speakers Using Transfer Learning (2021)0.00