Stylecap: Automatic Speaking-style Captioning From Speech Based On Speech And Language Self-supervised Learning Models
2023 Β· Kazuki Yamauchi, Yusuke Ijima, Yuki Saito
Abstract
We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. StyleCap is a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning. StyleCap is trained with paired data of speech and natural language descriptions. We train neural networks that convert a speech representation vector into prefix vectors that are fed into a large language model (LLM)-based text decoder. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SS
Authors
(none)
Tags
Stats
Related papers
- Speechcaps: Advancing Instruction-based Universal Speech Models With Multi-talker Speaking Style Captioning (2024)2.86
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Expressive Text-to-speech Using Style Tag (2021)10.85
- Style-talker: Finetuning Audio Language Model And Style-based Text-to-speech Model For Fast Spoken Dialogue Generation (2024)0.00
- Stylebook: Content-dependent Speaking Style Modeling For Any-to-any Voice Conversion Using Only Speech Data (2023)0.00
- Promptstyle: Controllable Style Transfer For Text-to-speech With Natural Language Descriptions (2023)10.85
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16