Muslcat: Multi-scale Multi-level Convolutional Attention Transformer For Discriminative Music Modeling On Raw Waveforms
2021 Β· Kai Middlebrook, Shyam Sudhakaran, David Guy Brizan
Abstract
In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporati
Authors
(none)
Tags
Stats
Related papers
- Sample-level CNN Architectures For Music Auto-tagging Using Raw Waveforms (2017)13.23
- Sample-level Deep Convolutional Neural Networks For Music Auto-tagging Using Raw Waveforms (2017)0.00
- Multi-level And Multi-scale Feature Aggregation Using Pre-trained Convolutional Neural Networks For Music Auto-tagging (2017)15.43
- Automatic Tagging Using Deep Convolutional Neural Networks (2016)0.00
- Toward Interpretable Music Tagging With Self-attention (2019)0.00
- Multi-class-token Transformer For Multitask Self-supervised Music Information Retrieval (2025)0.00
- Reconvat: A Semi-supervised Automatic Music Transcription Framework For Low-resource Real-world Data (2021)10.85
- Yourmt3+: Multi-instrument Music Transcription With Enhanced Transformer Architectures And Cross-dataset Stem Augmentation (2024)11.84