Audio-visual Scene Classification: Analysis Of DCASE 2021 Challenge Submissions
2021 Β· Shanshan Wang, Toni Heittola, Annamaria Mesaros, et al.
Abstract
This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1%.
Authors
(none)
Tags
Stats
Related papers
- Convolutional Neural Networks And X-vector Embedding For DCASE2018 Acoustic Scene Classification Challenge (2018)0.00
- A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification (2022)5.24
- Audio-visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy (2022)0.00
- Classifying Variable-length Audio Files With All-convolutional Networks And Masked Global Pooling (2016)0.00
- Automated Audio Captioning And Language-based Audio Retrieval (2022)0.00
- The NTT DCASE2020 Challenge Task 6 System: Automated Audio Captioning With Keywords And Sentence Length Estimation (2020)0.00
- Acoustic Scene Classification Using Multi-layer Temporal Pooling Based On Convolutional Neural Network (2019)0.00
- Attention And Localization Based On A Deep Convolutional Recurrent Model For Weakly Supervised Audio Tagging (2017)11.39