Codec-asr: Training Performant Automatic Speech Recognition Systems With Discrete Speech Representations
2024 Β· Kunal Dhawan, Nithin Rao Koluguri, Ante JukiΔ, et al.
Abstract
Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.
Authors
(none)
Tags
Stats
Related papers
- Spectral Codecs: Improving Non-autoregressive Speech Synthesis With Spectrogram-based Audio Codecs (2024)0.00
- Language-codec: Bridging Discrete Codec Representations And Speech Language Models (2024)4.64
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- ESC: Efficient Speech Coding With Cross-scale Residual Vector Quantized Transformers (2024)5.84
- Repcodec: A Speech Representation Codec For Speech Tokenization (2023)8.82
- Lscodec: Low-bitrate And Speaker-decoupled Discrete Speech Codec (2024)0.00
- Speech Resynthesis From Discrete Disentangled Self-supervised Representations (2021)16.25
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35