Speechcaps: Advancing Instruction-based Universal Speech Models With Multi-talker Speaking Style Captioning
2024 Β· Chien-Yu Huang, Min-Han Shih, Ke-Han Lu, et al.
Abstract
Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a fundamental task that benefits other downstream tasks. This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information. We used large language models to generate descriptions for multi-talker speech. Then, we trained our model with pre-training on this captioning task followed by instruction tuning. Evaluation on Dynamic-SUPERB shows our model outperforming the baseline pre-trained only on single-talker tasks, particularly in speaker and emotion recognition. Additionally, tests on a multi-talker QA task reveal that current models struggle with attributes such as gender, pitch, and speaking rate. The code and dataset are available at https://github.com/cyhuang-tw/speechcaps.
Authors
(none)
Tags
Stats
Code
Related papers
- Stylecap: Automatic Speaking-style Captioning From Speech Based On Speech And Language Self-supervised Learning Models (2023)6.34
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Emotioncaps: Enhancing Audio Captioning Through Emotion-augmented Data Generation (2024)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- S2cap: A Benchmark And A Baseline For Singing Style Captioning (2024)0.00
- Desta: Enhancing Speech Language Models Through Descriptive Speech-text Alignment (2024)9.03
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- Human Listening And Live Captioning: Multi-task Training For Speech Enhancement (2021)9.92