Abstract

Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results o

Authors

(none)

Tags

  • Speech Recognition
  • Speech Translation

Stats

  • citations1
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score2.26
  • arxiv keywang2024fusion

Related papers