Unified Cross-modal Translation Of Score Images, Symbolic Music, And Performance Audio
2025 Β· Jongmin Jung, Dongmin Kim, Sihun Lee, et al.
Abstract
Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of
Authors
(none)
Tags
Stats
Related papers
- Musictm-dataset For Joint Representation Learning Among Sheet Music, Lyrics, And Musical Audio (2020)3.58
- Multimodal Machine Translation Through Visuals And Speech (2019)12.68
- Meta-transformer: A Unified Framework For Multimodal Learning (2023)6.44
- Play As You Like: Timbre-enhanced Multi-modal Music Style Transfer (2018)9.92
- TMT: Tri-modal Translation Between Speech, Image, And Text By Processing Different Modalities As Different Languages (2024)2.26
- Gamma: Towards Joint Global-temporal Music Understanding In Large Multimodal Models (2026)0.00
- Mumu-llama: Multi-modal Music Understanding And Generation Via Large Language Models (2024)6.34
- Multimodal Dataset Normalization And Perceptual Validation For Music-taste Correspondences (2026)0.00