Improved Zero-shot Audio Tagging & Classification With Patchout Spectrogram Transformers
2022 Β· Paul Primus, Gerhard Widmer
Abstract
Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between trai
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Audio Classification Using Image Embeddings (2022)6.34
- Efficient Training Of Audio Transformers With Patchout (2021)22.11
- SSAST: Self-supervised Audio Spectrogram Transformer (2021)17.61
- Speech Enhancement With Zero-shot Model Selection (2020)7.81
- Avgzslnet: Audio-visual Generalized Zero-shot Learning By Reconstructing Label Features From Multi-modal Embeddings (2020)12.10
- An Empirical Study Of Weakly Supervised Audio Tagging Embeddings For General Audio Representations (2022)0.00
- Attention And Localization Based On A Deep Convolutional Recurrent Model For Weakly Supervised Audio Tagging (2017)11.39
- Efficient Large-scale Audio Tagging Via Transformer-to-cnn Knowledge Distillation (2022)17.68