S2cap: A Benchmark And A Baseline For Singing Style Captioning
2024 Β· Hyunjong Ok, Jaeho Lee
Abstract
Singing voices contain much richer information than common voices, including varied vocal and acoustic properties. However, current open-source audio-text datasets for singing voices capture only a narrow range of attributes and lack acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally define the singing style captioning task and present S2Cap, a dataset of singing voices with detailed descriptions covering diverse vocal, acoustic, and demographic characteristics. Using this dataset, we develop an efficient and straightforward baseline algorithm for singing style captioning. The dataset is available at https://zenodo.org/records/15673764.
Authors
(none)
Tags
Stats
Related papers
- Singing Voice Data Scaling-up: An Introduction To Ace-opencpop And Ace-kising (2024)15.48
- Singmos-pro: An Comprehensive Benchmark For Singing Quality Assessment (2025)0.00
- Singmos: An Extensive Open-source Singing Voice Dataset For MOS Prediction (2024)0.00
- Speechcaps: Advancing Instruction-based Universal Speech Models With Multi-talker Speaking Style Captioning (2024)2.86
- Deep Audio-visual Singing Voice Transcription Based On Self-supervised Learning Models (2023)0.00
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Ctrsvdd: A Benchmark Dataset And Baseline Analysis For Controlled Singing Voice Deepfake Detection (2024)0.00
- The Song Describer Dataset: A Corpus Of Audio Captions For Music-and-language Evaluation (2023)0.00