Composing General Audio Representation By Fusing Multilayer Features Of A Pre-trained Model
2022 Β· Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, et al.
Abstract
Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effective for general-purpose applications, consisting of two steps: (1) calculating feature vectors along the time frame from middle/late layer outputs, and (2) fusing them. This approach improves the utility of frequency and channel information in downstream processes, and combines the effectiveness of middle and late layer features for different tasks. As a result, the feature vectors become effective for general purposes. In the experiments using VGGish, PANNs' CNN14, and AST on nine downstream tasks, we first show that each layer output of these models serves different tasks. Then, we demonstrat
Authors
(none)
Tags
Stats
Related papers
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Automated Audio Captioning Via Fusion Of Low- And High- Dimensional Features (2022)0.00
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- Learning Robust Heterogeneous Signal Features From Parallel Neural Network For Audio Sentiment Analysis (2018)0.00
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Audio Source Separation Via Multi-scale Learning With Dilated Dense U-nets (2019)0.00
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Utilizing Domain Knowledge In End-to-end Audio Processing (2017)0.00