Learning Semantic Information From Raw Audio Signal Using Both Contextual And Phonetic Representations
2024 · Jaeyeon Kim, Injune Hwang, Kyogu Lee
Abstract
We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.
Authors
(none)
Tags
Stats
Related papers
- Bidirectional Representations For Low Resource Spoken Language Understanding (2022)0.00
- From Audio To Semantics: Approaches To End-to-end Spoken Language Understanding (2018)13.23
- Semantic Enrichment Towards Efficient Speech Representations (2023)0.00
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Exploiting Sentence And Context Representations In Deep Neural Models For Spoken Language Understanding (2016)0.00
- Modeling Speech Recognition And Synthesis Simultaneously: Encoding And Decoding Lexical And Sublexical Semantic Information Into Speech With No Direct Access To Speech Data (2022)4.52
- Contextualized Spoken Word Representations From Convolutional Autoencoders (2020)0.00
- Bootstrapping Meaning Through Listening: Unsupervised Learning Of Spoken Sentence Embeddings (2022)2.26