Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification
2022 Β· Sara Atito, Muhammad Awais, Wenwu Wang, et al.
Abstract
Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf\{L\}ocal-\textbf\{G\}lobal \textbf\{A\}udio \textbf\{S\}pectrogram v\textbf\{I\}sion \textbf\{T\}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learn
Authors
(none)
Tags
Stats
Related papers
- SSAST: Self-supervised Audio Spectrogram Transformer (2021)17.61
- Elasticast: An Audio Spectrogram Transformer For All Length And Resolutions (2024)3.58
- MAST: Multiscale Audio Spectrogram Transformers (2022)4.52
- Multi-class-token Transformer For Multitask Self-supervised Music Information Retrieval (2025)0.00
- Axlstms: Learning Self-supervised Audio Representations With Xlstms (2024)2.26
- Efficient Large-scale Audio Tagging Via Transformer-to-cnn Knowledge Distillation (2022)17.68
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Adapter Incremental Continual Learning Of Efficient Audio Spectrogram Transformers (2023)6.34