What Do Self-supervised Speech Models Know About Words?
2023 Β· Ankita Pasad, Chung-Ming Chien, Shane Settle, et al.
Abstract
Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. W
Authors
(none)
Tags
Stats
Related papers
- Layer-wise Analysis Of A Self-supervised Speech Representation Model (2021)17.07
- Comparative Layer-wise Analysis Of Self-supervised Speech Models (2022)0.00
- What Do Self-supervised Speech And Speaker Models Learn? New Findings From A Cross Model Layer-wise Analysis (2024)8.09
- Speech Representation Analysis Based On Inter- And Intra-model Similarities (2024)2.26
- A Large-scale Probing Analysis Of Speaker-specific Attributes In Self-supervised Speech Representations (2025)0.00
- Similarity Analysis Of Self-supervised Speech Representations (2020)10.07
- Don't Speak Too Fast: The Impact Of Data Bias On Self-supervised Speech Models (2021)8.35
- Self-supervised Learning For Speech Recognition With Intermediate Layer Supervision (2021)9.41