Comparative Analysis Of Modality Fusion Approaches For Audio-visual Person Identification And Verification
2024 Β· Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic, et al.
Abstract
Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identifica
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Speaker Verification And Recognition: A Multilingual Perspective (2020)0.00
- Automatic Quality Assessment For Audio-visual Verification Systems. The Love Submission To NIST SRE Challenge 2019 (2020)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Attention-based Cross-modal Fusion For Audio-visual Voice Activity Detection In Musical Video Streams (2021)5.24
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- Detecting Expressions With Multimodal Transformers (2020)10.74