Generative Spoken Language Model Based On Continuous Word-sized Audio Tokens
2023 Β· Robin Algayres, Yossi Adi, Tu Anh Nguyen, et al.
Abstract
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and
Authors
(none)
Tags
Stats
Related papers
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Text-free Prosody-aware Generative Spoken Language Modeling (2021)20.95
- Towards Language Modelling In The Speech Domain Using Sub-word Linguistic Units (2021)0.00
- Long-form Speech Generation With Spoken Language Models (2024)0.00
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- Multimodal Latent Language Modeling With Next-token Diffusion (2024)0.00
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00