Causal Self-supervised Pretrained Frontend With Predictive Code For Speech Separation
2025 Β· Wupeng Wang, Zexu Pan, Xinke Li, et al.
Abstract
Speech separation (SS) seeks to disentangle a multi-talker speech mixture into single-talker speech streams. Although SS can be generally achieved using offline methods, such a processing paradigm is not suitable for real-time streaming applications. Causal separation models, which rely only on past and present information, offer a promising solution for real-time streaming. However, these models typically suffer from notable performance degradation due to the absence of future context. In this paper, we introduce a novel frontend that is designed to mitigate the mismatch between training and run-time inference by implicitly incorporating future information into causal models through predictive patterns. The pretrained frontend employs a transformer decoder network with a causal convolutional encoder as the backbone and is pretrained in a self-supervised manner with two innovative pretext tasks: autoregressive hybrid prediction and contextual knowledge distillation. These tasks enable
Authors
(none)
Tags
Stats
Related papers
- Contrastive Separative Coding For Self-supervised Representation Learning (2021)0.00
- Directed Speech Separation For Automatic Speech Recognition Of Long Form Conversational Speech (2021)2.26
- Causal Speech Enhancement With Predicting Semantics Based On Quantized Self-supervised Learning Features (2024)3.58
- Dualsep: A Light-weight Dual-encoder Convolutional Recurrent Network For Real-time In-car Speech Separation (2024)0.00
- End-to-end Source Separation With Adaptive Front-ends (2017)12.17
- A Conformer-based ASR Frontend For Joint Acoustic Echo Cancellation, Speech Enhancement And Speech Separation (2021)9.23
- Transmask: A Compact And Fast Speech Separation Model Based On Transformer (2021)8.82
- Separate And Reconstruct: Asymmetric Encoder-decoder For Speech Separation (2024)0.00