Expressivity-aware Music Performance Retrieval Using Mid-level Perceptual Features And Emotion Word Embeddings
2024 Β· Shreyan Chowdhury, Gerhard Widmer
Abstract
This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cr
Authors
(none)
Tags
Stats
Related papers
- Emotion Embedding Spaces For Matching Music To Stories (2021)0.00
- Contrastive Learning For Cross-modal Artist Retrieval (2023)0.00
- Towards Robust And Truly Large-scale Audio-sheet Music Retrieval (2023)4.52
- Towards End-to-end Audio-sheet-music Retrieval (2016)0.00
- Exploring Modality-agnostic Representations For Music Classification (2021)0.00
- Audio-visual Embedding For Cross-modal Musicvideo Retrieval Through Supervised Deep CCA (2019)11.93
- Enriching Music Descriptions With A Finetuned-llm And Metadata For Text-to-music Retrieval (2024)7.50
- Wikimute: A Web-sourced Dataset Of Semantic Descriptions For Music Audio (2023)5.24