A Light-weight Multimodal Framework For Improved Environmental Audio Tagging
2017 Β· Juncheng Li, Yun Wang, Joseph Szurley, et al.
Abstract
The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, but they are inevitably resource hungry and require a long time to train. In this work, we propose a light-weight, multimodal framework for environmental audio tagging. The audio branch of the framework is a convolutional and recurrent neural network (CRNN) based on multiple instance learning (MIL). It is trained with the audio tracks of a large collection of weakly labeled YouTube video excerpts; the video branch uses pretrained state-of-the-art image recognition networks and word embeddings to extract information from the video track and to map visual objects to sound events. Experiments on the audio tagging task of the DCASE 2017 challenge show that the incorporation of video information improves a strong baseline audio tagging syste
Authors
(none)
Tags
Stats
Related papers
- Attention And Localization Based On A Deep Convolutional Recurrent Model For Weakly Supervised Audio Tagging (2017)11.39
- Multiple Instance Deep Learning For Weakly Supervised Small-footprint Audio Event Detection (2017)7.50
- Convolutional Gated Recurrent Neural Network Incorporating Spatial Features For Audio Tagging (2017)13.23
- Fully Dnn-based Multi-label Regression For Audio Tagging (2016)0.00
- Sample Mixed-based Data Augmentation For Domestic Audio Tagging (2018)0.00
- Combining High-level Features Of Raw Audio Waves And Mel-spectrograms For Audio Tagging (2018)0.00
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77