Self-supervised Contrastive Cross-modality Representation Learning For Spoken Question Answering
2021 Β· Chenyu You, Nuo Chen, Yuexian Zou
Abstract
Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guid
Authors
(none)
Tags
Stats
Related papers
- End-to-end Contrastive Language-speech Pretraining Model For Long-form Spoken Question Answering (2025)0.00
- Speechbert: An Audio-and-text Jointly Learned Language Model For End-to-end Spoken Question Answering (2019)12.33
- Mitigating The Impact Of Speech Recognition Errors On Spoken Question Answering By Adversarial Domain Adaptation (2019)6.77
- Contrastive Learning For Improving ASR Robustness In Spoken Language Understanding (2022)6.34
- Automatic Data Augmentation Selection And Parametrization In Contrastive Self-supervised Speech Representation Learning (2022)5.24
- QS-TTS: Towards Semi-supervised Text-to-speech Synthesis Via Vector-quantized Self-supervised Speech Representation Learning (2023)2.26
- Joint Training Of Speech Enhancement And Self-supervised Model For Noise-robust ASR (2022)0.00
- Sviqa: A Unified Speech-vision Multimodal Model For Textless Visual Question Answering (2025)0.00