Language-based Audio Retrieval With Converging Tied Layers And Contrastive Loss
2022 Β· Andrew Koh, Eng Siong Chng
Abstract
In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low training memory requirement, we are able to use pretrained models as it is without needing to finetune them. We test our methods and show that using a combination of our methods beats the baseline scores significantly.
Authors
(none)
Tags
Stats
Related papers
- Improving Natural-language-based Audio Retrieval With Transfer Learning And Audio & Text Augmentations (2022)0.00
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)19.60
- Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval (2024)5.84
- Automated Audio Captioning And Language-based Audio Retrieval (2022)0.00
- Contrastive Latent Space Reconstruction Learning For Audio-text Retrieval (2023)3.58
- Performance Improvement Of Language-queried Audio Source Separation Based On Caption Augmentation From Large Language Models For DCASE Challenge 2024 Task 9 (2024)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- CTAL: Pre-training Cross-modal Transformer For Audio-and-language Representations (2021)7.50