End-to-end Contrastive Language-speech Pretraining Model For Long-form Spoken Question Answering
2025 Β· Jiliang Hu, Zuchao Li, Baoyuan Qi, et al.
Abstract
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Contrastive Cross-modality Representation Learning For Spoken Question Answering (2021)9.41
- Speechbert: An Audio-and-text Jointly Learned Language Model For End-to-end Spoken Question Answering (2019)12.33
- CLASP: Contrastive Language-speech Pretraining For Multilingual Multimodal Information Retrieval (2024)0.00
- Spoken Question Answering And Speech Continuation Using Spectrogram-powered LLM (2023)2.76
- Speechdpr: End-to-end Spoken Passage Retrieval For Open-domain Spoken Question Answering (2024)0.00
- Retrieval Augmented Generation In Prompt-based Text-to-speech Synthesis With Context-aware Contrastive Language-audio Pretraining (2024)0.00
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00