Investigating The Effects Of Large-scale Pseudo-stereo Data And Different Speech Foundation Model On Dialogue Generative Spoken Language Model
2024 Β· Yu-Kuan Fu, Cheng-Kuang Lee, Hsiu-Hsuan Wang, et al.
Abstract
Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels, a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation models for spoken dialogue generation.
Authors
(none)
Tags
Stats
Related papers
- Generating Data With Text-to-speech And Large-language Models For Conversational Speech Recognition (2024)6.34
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Resource-efficient Adaptation Of Speech Foundation Models For Multi-speaker ASR (2024)3.58
- Exploring Speech Foundation Models For Speaker Diarization In Child-adult Dyadic Interactions (2024)5.24
- Dialogueagents: A Hybrid Agent-based Speech Synthesis Framework For Multi-party Dialogue (2025)1.69
- Scale This, Not That: Investigating Key Dataset Attributes For Efficient Speech Enhancement Scaling (2024)0.00
- Property-aware Multi-speaker Data Simulation: A Probabilistic Modelling Technique For Synthetic Data Generation (2023)6.34
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32