Temporarily-aware Context Modelling Using Generative Adversarial Networks For Speech Activity Detection
2020 Β· Tharindu Fernando, Sridha Sridharan, Mitchell McLaren, et al.
Abstract
This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment. In order to exploit the temporal relationships within the input signal, we propose a temporal discriminator which aims to ensure that the predicted signal is temporally consistent. We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC, where we demonstrate its capability to outperform state-of-the-art SAD approaches. Furthermore, our cross-database evaluations demonstrate the robustness of the proposed approach across different languages, accents, and acoustic environments.
Authors
(none)
Tags
Stats
Related papers
- Speech Activity Detection Based On Multilingual Speech Recognition System (2020)5.24
- End-to-end Audiovisual Speech Activity Detection With Bimodal Recurrent Neural Models (2018)10.48
- Speech Enhancement Aided End-to-end Multi-task Learning For Voice Activity Detection (2020)11.49
- SADDEL: Joint Speech Separation And Denoising Model Based On Multitask Learning (2020)0.00
- Adversarial Multi-task Deep Learning For Noise-robust Voice Activity Detection With Low Algorithmic Delay (2022)2.26
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- Is Someone Speaking? Exploring Long-term Temporal Features For Audio-visual Active Speaker Detection (2021)21.12
- Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework (2017)10.21