Analyzing Acoustic Word Embeddings From Pre-trained Self-supervised Speech Models
2022 Β· Ramon Sanabria, Hao Tang, Sharon Goldwater
Abstract
Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53 (as well as Wav2Vec 2.0 trained on English).
Authors
(none)
Tags
Stats
Related papers
- Supervised Acoustic Embeddings And Their Transferability Across Languages (2023)0.00
- Improving Acoustic Word Embeddings Through Correspondence Training Of Self-supervised Speech Representations (2024)0.00
- Layer-wise Analysis Of Self-supervised Acoustic Word Embeddings: A Study On Speech Emotion Recognition (2024)0.00
- A Comparison Of Self-supervised Speech Representations As Input Features For Unsupervised Acoustic Word Embeddings (2020)7.16
- Leveraging Multilingual Transfer For Unsupervised Semantic Acoustic Word Embeddings (2023)3.58
- An Exploration Of Self-supervised Pretrained Representations For End-to-end Speech Recognition (2021)12.25
- Speech Representation Analysis Based On Inter- And Intra-model Similarities (2024)2.26
- Improved Language Identification Through Cross-lingual Self-supervised Learning (2021)10.61