Speechcraft: A Fine-grained Expressive Speech Dataset With Natural Language Description
2024 Β· Zeyu Jin, Jia Jia, Qixin Wang, et al.
Abstract
Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation frameworks with limited information and diversity, our system provides in-depth understandings of speec
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Emospeech: A Corpus Of Emotionally Rich And Contextually Detailed Speech Annotations (2024)0.00
- Natural Language Guidance Of High-fidelity Text-to-speech With Synthetic Annotations (2024)0.00
- Storytts: A Highly Expressive Text-to-speech Dataset With Rich Textual Expressiveness Annotations (2024)3.58
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Expressive TTS Driven By Natural Language Prompts Using Few Human Annotations (2023)0.00
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- Speechllm-as-judges: Towards General And Interpretable Speech Quality Evaluation (2025)2.60