Axlstms: Learning Self-supervised Audio Representations With Xlstms
2024 Β· Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan
Abstract
While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach for learning audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.
Authors
(none)
Tags
Stats
Related papers
- Audio Mamba: Selective State Spaces For Self-supervised Audio Representations (2024)9.23
- SSAST: Self-supervised Audio Spectrogram Transformer (2021)17.61
- Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification (2022)8.35
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Multi-class-token Transformer For Multitask Self-supervised Music Information Retrieval (2025)0.00
- A Low Latency Attention Module For Streaming Self-supervised Speech Representation Learning (2023)0.00
- Audio Mamba: Bidirectional State Space Model For Audio Representation Learning (2024)11.58
- XLST: Cross-lingual Self-training To Learn Multilingual Representation For Low Resource Speech Recognition (2021)8.82