Joint Speaker Features Learning For Audio-visual Multichannel Speech Separation And Recognition
2024 Β· Guinan Li, Jiajun Deng, Youjun Chen, et al.
Abstract
This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality.
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77
- Multi-channel Speaker Verification For Single And Multi-talker Speech (2020)0.00
- Efficient Integration Of Multi-channel Information For Speaker-independent Speech Separation (2020)0.00
- TS-SEP: Joint Diarization And Separation Conditioned On Estimated Speaker Embeddings (2023)10.35
- Y-vector: Multiscale Waveform Encoder For Speaker Embedding (2020)8.60
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00