A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification
2022 Β· Qing Wang, Jun Du, Siyuan Zheng, et al.
Abstract
In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to effectively combine the video and audio modalities. Moreover, data augmentation strategies are investigated to increase audio-visual training set size. For the video modality the effectiveness of several operations in RandAugment is verified. An audio-video joint mixup scheme is proposed to further improve AVSC performances. Evaluated on the development set of TAU Urban Audio Visual Scenes 2021, our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy (2022)0.00
- Audio-visual Scene Classification: Analysis Of DCASE 2021 Challenge Submissions (2021)0.00
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- An Empirical Study Of Visual Features For DNN Based Audio-visual Speech Enhancement In Multi-talker Environments (2020)3.58
- Exploring Train And Test-time Augmentations For Audio-language Learning (2022)0.00
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03