OWSM-CTC: An Open Encoder-only Speech Foundation Model For Speech Recognition, Translation, And Language Identification
2024 Β· Yifan Peng, Yui Sudo, Muhammad Shakeel, et al.
Abstract
There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-C
Authors
(none)
Tags
Stats
Related papers
- On The Effects Of Heterogeneous Data Sources On Speech-to-text Foundation Models (2024)5.84
- Exploring The Limits Of Decoder-only Models Trained On Public Speech Recognition Corpora (2024)4.52
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Bridging The Gaps Of Both Modality And Language: Synchronous Bilingual CTC For Speech Translation And Speech Recognition (2023)4.49
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition (2023)13.45
- BERT Meets CTC: New Formulation Of End-to-end Speech Recognition With Pre-trained Masked Language Model (2022)0.00
- Cotatron: Transcription-guided Speech Encoder For Any-to-many Voice Conversion Without Parallel Data (2020)11.49