Contextual Joint Factor Acoustic Embeddings
2019 Β· Yanpei Shi, Thomas Hain
Abstract
Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output. The second approach is a contextual joint factor analysis encoder, where the encoder is trained to analyse joint factors from the source signal that correlates best with the neighbouring audio. To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted -- phone classification and speaker recognition -- and test on different TIMIT data sets. Experimental results show that one of the proposed approaches outperforms phone classification baselines, yielding a classification accuracy o
Authors
(none)
Tags
Stats
Related papers
- Investigating Design Choices In Joint-embedding Predictive Architectures For General Audio Representation Learning (2024)2.26
- Using Previous Acoustic Context To Improve Text-to-speech Synthesis (2020)0.00
- Content-context Factorized Representations For Automated Speech Recognition (2022)6.34
- An Analysis On The Effects Of Speaker Embedding Choice In Non Auto-regressive TTS (2023)0.00
- A-JEPA: Joint-embedding Predictive Architecture Can Listen (2023)0.00
- Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints In Encoder-decoder Models (2018)0.00
- Acoustic Neighbor Embeddings (2020)0.00
- A Universally-deployable ASR Frontend For Joint Acoustic Echo Cancellation, Speech Enhancement, And Voice Separation (2022)5.84