Abstract

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf\{L\}ocal-\textbf\{G\}lobal \textbf\{A\}udio \textbf\{S\}pectrogram v\textbf\{I\}sion \textbf\{T\}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learn

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations12
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score8.35
  • arxiv keyatito2022asit

Related papers