Spatial Hubert: Self-supervised Spatial Speech Representation Learning For A Single Talker From Multi-channel Audio
2023 Β· Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, et al.
Abstract
Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio recordings. This paper presents Spatial HuBERT, a self-supervised speech representation model that learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment by using multi-channel audio inputs. Spatial HuBERT learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks, particularly in reverberant and noisy environments. We also demonstrate the utility of the representations learned by Spatial HuBERT on a speech localisation downstream task. Along with this paper, we publi
Authors
(none)
Tags
Stats
Related papers
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- Hubertopic: Enhancing Semantic Representation Of Hubert Through Self-supervision Utilizing Topic Model (2023)0.00
- Selective Hubert: Self-supervised Pre-training For Target Speaker In Clean And Mixture Speech (2023)7.81
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Cocktail Hubert: Generalized Self-supervised Pre-training For Mixture And Single-source Speech (2023)6.77
- Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction (2022)5.99
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00