Loss Masking Is Not Needed In Decoder-only Transformer For Discrete-token-based ASR
2023 Β· Qian Chen, Wen Wang, Qinglin Zhang, et al.
Abstract
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for
Authors
(none)
Tags
Stats
Related papers
- Effective Decoder Masking For Transformer Based End-to-end Speech Recognition (2020)0.00
- Streaming Decoder-only Automatic Speech Recognition With Discrete Speech Units: A Pilot Study (2024)4.52
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Recycle-and-distill: Universal Compression Strategy For Transformer-based Speech SSL Models With Attention Map Reusing And Masking Distillation (2023)5.84
- Single-stage TTS With Masked Audio Token Modeling And Semantic Knowledge Distillation (2024)0.00
- Semantic Mask For Transformer Based End-to-end Speech Recognition (2019)9.41
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52
- Acoustic-aware Non-autoregressive Spell Correction With Mask Sample Decoding (2022)0.00