Multi-class-token Transformer For Multitask Self-supervised Music Information Retrieval
2025 Β· Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, et al.
Abstract
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivaria
Authors
(none)
Tags
Stats
Related papers
- Toward Interpretable Music Tagging With Self-attention (2019)0.00
- Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification (2022)8.35
- SSAST: Self-supervised Audio Spectrogram Transformer (2021)17.61
- Comparing Supervised And Self-supervised Embedding For Exvo Multi-task Learning Track (2022)0.00
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Axlstms: Learning Self-supervised Audio Representations With Xlstms (2024)2.26
- Yourmt3+: Multi-instrument Music Transcription With Enhanced Transformer Architectures And Cross-dataset Stem Augmentation (2024)11.84
- MAST: Multiscale Audio Spectrogram Transformers (2022)4.52