Deep Cross-modal Correlation Learning For Audio And Lyrics In Music Retrieval
2017 Β· Yi Yu, Suhua Tang, Francisco Raposo, et al.
Abstract
Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent ly
Authors
(none)
Tags
Stats
Related papers
- HCLAS-X: Hierarchical And Cascaded Lyrics Alignment System Using Multimodal Cross-correlation (2023)0.00
- Exploiting Synchronized Lyrics And Vocal Features For Music Emotion Detection (2019)0.00
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84
- Musictm-dataset For Joint Representation Learning Among Sheet Music, Lyrics, And Musical Audio (2020)3.58
- Towards Contrastive Learning In Music Video Domain (2023)0.00
- Collap: Contrastive Long-form Language-audio Pretraining With Musical Temporal Structure Augmentation (2024)3.58
- Music Mood Detection Based On Audio And Lyrics With Deep Neural Net (2018)0.00
- Recent Advances And Challenges In Deep Audio-visual Correlation Learning (2022)5.24