A Comparison Of Self-supervised Speech Representations As Input Features For Unsupervised Acoustic Word Embeddings
2020 Β· Lisa van Staden, Herman Kamper
Abstract
Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the segment level, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCC
Authors
(none)
Tags
Stats
Related papers
- Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints In Encoder-decoder Models (2018)0.00
- Improving Acoustic Word Embeddings Through Correspondence Training Of Self-supervised Speech Representations (2024)0.00
- Layer-wise Analysis Of Self-supervised Acoustic Word Embeddings: A Study On Speech Emotion Recognition (2024)0.00
- Supervised Acoustic Embeddings And Their Transferability Across Languages (2023)0.00
- Analyzing Acoustic Word Embeddings From Pre-trained Self-supervised Speech Models (2022)9.03
- Unsupervised Feature Learning For Speech Using Correspondence And Siamese Networks (2020)8.09
- Leveraging Multilingual Transfer For Unsupervised Semantic Acoustic Word Embeddings (2023)3.58
- Unsupervised Neural And Bayesian Models For Zero-resource Speech Processing (2017)0.00