Streaming Audio Transformers For Online Audio Tagging
2023 Β· Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, et al.
Abstract
Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.
Authors
(none)
Tags
Stats
Code
Related papers
- Efficient Large-scale Audio Tagging Via Transformer-to-cnn Knowledge Distillation (2022)17.68
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- SSAST: Self-supervised Audio Spectrogram Transformer (2021)17.61
- Taming Data And Transformers For Audio Generation (2024)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Study Of Lightweight Transformer Architectures For Single-channel Speech Enhancement (2025)3.58
- Asit: Local-global Audio Spectrogram Vision Transformer For Event Classification (2022)8.35