Chunk Based Speech Pre-training With High Resolution Finite Scalar Quantization
2025 Β· Yun Tang, Cindy Tseng
Abstract
Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input sp
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Learning With Bi-label Masked Speech Prediction For Streaming Multi-talker Speech Recognition (2022)5.24
- Stablequant: Layer Adaptive Post-training Quantization For Speech Foundation Models (2025)2.26
- A Pre-training Framework That Encodes Noise Information For Speech Quality Assessment (2024)3.58
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00
- Self-supervised Learning With Random-projection Quantizer For Speech Recognition (2022)0.00
- Exploration Of Efficient End-to-end ASR Using Discretized Input From Self-supervised Learning (2023)12.02
- An Adapter Based Pre-training For Efficient And Scalable Self-supervised Speech Representation Learning (2021)8.35