Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications
2023 Β· Varun Krishna, Tarun Sai, Sriram Ganapathy
Abstract
The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers. The learned "time-frequency" representations from the convolutional neural network (CNN) module are further processed with long short term memory (LSTM) layers which generate a contextual vector representation for every windowed segment. The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations. The targets consist of phoneme-like pseudo labels for each audio segment and these are generated with an iterative k-means algorithm. We explore techniques that impro
Authors
(none)
Tags
Stats
Related papers
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Spatial Hubert: Self-supervised Spatial Speech Representation Learning For A Single Talker From Multi-channel Audio (2023)0.00
- Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction (2022)5.99
- Unsupervised Lexicon Learning From Speech Is Limited By Representations Rather Than Clustering (2025)0.00
- SLICER: Learning Universal Audio Representations Using Low-resource Self-supervised Pre-training (2022)0.00
- Pushing The Limits Of Unsupervised Unit Discovery For SSL Speech Representation (2023)6.34
- Learning Hidden Unit Contributions For Unsupervised Acoustic Model Adaptation (2016)14.47
- Deep Self-supervised Hierarchical Clustering For Speaker Diarization (2020)5.24