Unsupervised Cross-modal Audio Representation Learning From Unstructured Multilingual Text
2020 Β· Alexander Schindler, Sergiu Gordea, Peter Knees
Abstract
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. W
Authors
(none)
Tags
Stats
Related papers
- Deep Latent Space Learning For Cross-modal Mapping Of Audio And Visual Signals (2019)12.17
- Deep Triplet Neural Networks With Cluster-cca For Audio-visual Cross-modal Retrieval (2019)12.61
- Learning Joint Embedding For Cross-modal Retrieval (2019)5.84
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Deep Cross-modal Correlation Learning For Audio And Lyrics In Music Retrieval (2017)14.06
- Contrastive Learning For Cross-modal Artist Retrieval (2023)0.00
- Matching Text And Audio Embeddings: Exploring Transfer-learning Strategies For Language-based Audio Retrieval (2022)0.00
- See, Hear, And Read: Deep Aligned Representations (2017)0.00