Towards Improving Nam-to-speech Synthesis Intelligibility Using Self-supervised Speech Models
2024 Β· Neil Shah, Shirish Karande, Vineet Gandhi
Abstract
We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at https://nam2speech.github.io/NAM2Speech/
Authors
(none)
Tags
Stats
Related papers
- Advancing Nam-to-speech Conversion With Novel Methods And The Multinam Dataset (2024)3.58
- Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision (2018)9.92
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Deep Audio-visual Singing Voice Transcription Based On Self-supervised Learning Models (2023)0.00
- Ccc-wav2vec 2.0: Clustering Aided Cross Contrastive Self-supervised Learning Of Speech Representations (2022)7.81
- Joint Training Of Speech Enhancement And Self-supervised Model For Noise-robust ASR (2022)0.00
- Selfvc: Voice Conversion With Iterative Refinement Using Self Transformations (2023)0.00
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85