Speech Sequence Embeddings Using Nearest Neighbors Contrastive Learning
2022 Β· Robin Algayres, Adel Nabli, Benoit Sagot, et al.
Abstract
We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-by-example task on the LibriSpeech dataset to monitor future improvements in the field.
Authors
(none)
Tags
Stats
Related papers
- Learning Word Embeddings From Speech (2017)0.00
- Supervised Contrastive Learning With Nearest Neighbor Search For Speech Emotion Recognition (2023)7.16
- Learning Video Representations Using Contrastive Bidirectional Transformer (2019)0.00
- Contrastive Separative Coding For Self-supervised Representation Learning (2021)0.00
- Speech2vec: A Sequence-to-sequence Framework For Learning Word Embeddings From Speech (2018)14.15
- Acoustic Neighbor Embeddings (2020)0.00
- Query-by-example Search With Discriminative Neural Acoustic Word Embeddings (2017)12.40
- Speaker Representation Learning Via Contrastive Loss With Maximal Speaker Separability (2022)10.68