Exploring Compressibility Of Transformer Based Text-to-music (TTM) Models
2024 Β· Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, et al.
Abstract
State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.
Authors
(none)
Tags
Stats
Related papers
- Quality-aware Masked Diffusion Transformer For Enhanced Music Generation (2024)5.60
- Diffusion Based Text-to-music Generation With Global And Local Text Based Conditioning (2025)0.00
- EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-to-speech Models (2022)3.58
- Empirical Evaluation Of Deep Learning Model Compression Techniques On The Wavenet Vocoder (2020)0.00
- Is Smaller Always Faster? Tradeoffs In Compressing Self-supervised Speech Transformers (2022)0.00
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00
- Fakemusiccaps: A Dataset For Detection And Attribution Of Synthetic Music Generated Via Text-to-music Models (2024)0.00
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24