Audio-visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy
2022 Β· Chengxin Chen, Meng Wang, Pengyuan Zhang
Abstract
Recently, audio-visual scene classification (AVSC) has attracted increasing attention from multidisciplinary communities. Previous studies tended to adopt a pipeline training strategy, which uses well-trained visual and acoustic encoders to extract high-level representations (embeddings) first, then utilizes them to train the audio-visual classifier. In this way, the extracted embeddings are well suited for uni-modal classifiers, but not necessarily suited for multi-modal ones. In this paper, we propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training. We evaluate the approach on the development dataset of TAU Urban Audio-Visual Scenes 2021. The experimental results show that our proposed approach achieves significant improvement over the conventi
Authors
(none)
Tags
Stats
Related papers
- A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification (2022)5.24
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Audio-visual Scene Classification: Analysis Of DCASE 2021 Challenge Submissions (2021)0.00
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00