Songformer: Scaling Music Structure Analysis With Heterogeneous Supervision
2025 Β· Chunbo Hao, Ruibin Yuan, Jixun Yao, et al.
Abstract
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed toleranc
Authors
(none)
Tags
Stats
Related papers
- Supervised Metric Learning For Music Structure Features (2021)0.00
- Convolutive Block-matching Segmentation Algorithm With Application To Music Structure Analysis (2022)0.00
- Ssm-net: Feature Learning For Music Structure Analysis Using A Self-similarity-matrix Based Loss (2022)0.00
- Supervised And Unsupervised Learning Of Audio Representations For Music Understanding (2022)0.00
- Songmass: Automatic Song Writing With Pre-training And Alignment Constraint (2020)11.39
- Deep Audio-visual Singing Voice Transcription Based On Self-supervised Learning Models (2023)0.00
- Learning Music Audio Representations Via Weak Language Supervision (2021)10.07
- Singmos-pro: An Comprehensive Benchmark For Singing Quality Assessment (2025)0.00