Cocktail Hubert: Generalized Self-supervised Pre-training For Mixture And Single-source Speech
2023 Β· Maryam Fazel-Zarandi, Wei-Ning Hsu
Abstract
Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
Authors
(none)
Tags
Stats
Related papers
- Selective Hubert: Self-supervised Pre-training For Target Speaker In Clean And Mixture Speech (2023)7.81
- Spatial Hubert: Self-supervised Spatial Speech Representation Learning For A Single Talker From Multi-channel Audio (2023)0.00
- Ms-hubert: Mitigating Pre-training And Inference Mismatch In Masked Language Modelling Methods For Learning Speech Representations (2024)4.52
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- U-hubert: Unified Mixed-modal Speech Pretraining And Zero-shot Transfer To Unlabeled Modality (2022)5.99
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- An Adapter Based Multi-label Pre-training For Speech Separation And Enhancement (2022)7.50
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00