Transformation Of Audio Embeddings Into Interpretable, Concept-based Representations
2025 Β· Alice Zhang, Edison Thomaz, Lie Lu
Abstract
Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabulari
Authors
(none)
Tags
Stats
Related papers
- Diverse Audio Embeddings -- Bringing Features Back Outperforms CLAP! (2023)0.00
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- Enclap: Combining Neural Audio Codec And Audio-text Joint Embedding For Automated Audio Captioning (2024)14.03
- Human-clap: Human-perception-based Contrastive Language-audio Pretraining (2025)4.52
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- On The Language Encoder Of Contrastive Cross-modal Models (2023)0.00
- Audio Concept Classification With Hierarchical Deep Neural Networks (2017)0.00
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00