Self-supervised Representation Learning For Speech Using Visual Grounding And Masked Language Modeling
2022 Β· Puyuan Peng, David Harwath
Abstract
In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.
Authors
(none)
Tags
Stats
Related papers
- Fast-slow Transformer For Visually Grounding Speech (2021)9.76
- SUPERB @ SLT 2022: Challenge On Generalization And Efficiency Of Self-supervised Speech Representation Learning (2022)9.23
- Transformer VQ-VAE For Unsupervised Unit Discovery And Speech Synthesis: Zerospeech 2020 Challenge (2020)9.41
- The Zero Resource Speech Benchmark 2021: Metrics And Baselines For Unsupervised Spoken Language Modeling (2020)0.00
- Syllable Discovery And Cross-lingual Generalization In A Visually Grounded, Self-supervised Speech Model (2023)7.81
- Unsupervised Acoustic Unit Representation Learning For Voice Conversion Using Wavenet Auto-encoders (2020)7.16
- Self-supervised Language Learning From Raw Audio: Lessons From The Zero Resource Speech Challenge (2022)10.07
- Improving Unsupervised Subword Modeling Via Disentangled Speech Representation Learning And Transformation (2019)5.24