Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition
2017 Β· Abhinav Thanda, Shankar M Venkatesan
Abstract
Multi-task learning (MTL) involves the simultaneous training of two or more related tasks over shared representations. In this work, we apply MTL to audio-visual automatic speech recognition(AV-ASR). Our primary task is to learn a mapping between audio-visual fused features and frame labels obtained from acoustic GMM/HMM model. This is combined with an auxiliary task which maps visual features to frame labels obtained from a separate visual GMM/HMM model. The MTL model is tested at various levels of babble noise and the results are compared with a base-line hybrid DNN-HMM AV-ASR model. Our results indicate that MTL is especially useful at higher level of noise. Compared to base-line, upto 7% relative improvement in WER is reported at -3 SNR dB
Authors
(none)
Tags
Stats
Related papers
- Incorporating VAD Into ASR System By Multi-task Learning (2021)4.52
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Audio Visual Speech Recognition Using Deep Recurrent Neural Networks (2016)7.81
- Acquiring Pronunciation Knowledge From Transcribed Speech Audio Via Multi-task Learning (2024)0.00
- Using Multi-task Learning To Improve The Performance Of Acoustic-to-word And Conventional Hybrid Models (2019)0.00
- Tandem Multitask Training Of Speaker Diarisation And Speech Recognition For Meeting Transcription (2022)7.81