Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification

Abstract

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf\{L\}ocal-\textbf\{G\}lobal \textbf\{A\}udio \textbf\{S\}pectrogram v\textbf\{I\}sion \textbf\{T\}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learn

Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification

Abstract

Authors

Tags

Stats

Related papers