Towards Unsupervised Speech Recognition Without Pronunciation Models
2024 Β· Junrui Ni, Liming Wang, Yang Zhang, et al.
Abstract
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpas
Authors
(none)
Tags
Stats
Related papers
- Towards Unsupervised Automatic Speech Recognition Trained By Unaligned Speech And Text Only (2018)0.00
- Unsupervised Automatic Speech Recognition: A Review (2021)13.50
- Unsupervised Speech Recognition (2021)0.00
- Unsupervised Speech Recognition Via Segmental Empirical Output Distribution Matching (2018)0.00
- Towards Unsupervised Speech-to-text Translation (2018)0.00
- Unsupervised Neural And Bayesian Models For Zero-resource Speech Processing (2017)0.00
- Analyzing The Robustness Of Unsupervised Speech Recognition (2021)7.81
- Semi-supervised Sequence-to-sequence ASR Using Unpaired Speech And Text (2019)0.00