Replacing Human Audio With Synthetic Audio For On-device Unspoken Punctuation Prediction
2020 · Daria Soboleva, Ondrej Skopek, Márius Šajgalík, et al.
Abstract
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Semi-supervised Learning Framework For Punctuation Prediction In Conversational Speech (2020)9.59
- Improved Training For End-to-end Streaming Automatic Speech Recognition Model With Punctuation (2023)0.00
- Towards Unsupervised Speech Recognition Without Pronunciation Models (2024)0.00
- Unified Multimodal Punctuation Restoration Framework For Mixed-modality Corpus (2022)7.16
- Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context For Continuous Speech Recognition (2023)4.52
- Natural Language Guidance Of High-fidelity Text-to-speech With Synthetic Annotations (2024)0.00
- Alternate Endings: Improving Prosody For Incremental Neural TTS With Predicted Future Text Input (2021)6.34
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00