Cstnet: Contrastive Speech Translation Network For Self-supervised Speech Representation Learning
2020 Β· Sameer Khurana, Antoine Laurent, James Glass
Abstract
More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learnin
Authors
(none)
Tags
Stats
Related papers
- Learning Speech Representation From Contrastive Token-acoustic Pretraining (2023)7.81
- One-to-many Multilingual End-to-end Speech Translation (2019)9.23
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- Gl-clef: A Global-local Contrastive Learning Framework For Cross-lingual Spoken Language Understanding (2022)10.35
- Contrastive Separative Coding For Self-supervised Representation Learning (2021)0.00
- Automatic Data Augmentation Selection And Parametrization In Contrastive Self-supervised Speech Representation Learning (2022)5.24
- HC\(^2\)L: Hybrid And Cooperative Contrastive Learning For Cross-lingual Spoken Language Understanding (2024)4.52
- Non-contrastive Self-supervised Learning For Utterance-level Information Extraction From Speech (2022)9.59