Improving Query-by-vocal Imitation With Contrastive Learning And Audio Pretraining
2024 Β· Jonathan Greif, Florian Schmid, Paul Primus, et al.
Abstract
Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and convenient approach compared to text-based search. To fully leverage QBV, developing robust audio feature representations for both the vocal imitation and the original sound is crucial. In this paper, we present a new system for QBV that utilizes the feature extraction capabilities of Convolutional Neural Networks pre-trained with large-scale general-purpose audio datasets. We integrate these pre-trained models into a dual encoder architecture and fine-tune them end-to-end using contrastive learning. A distinctive aspect of our proposed method is the fine-tuning strategy of pre-trained models using an adapted NT-Xent loss for contrastive learning, creating a shared embedding space for reference recordings and vocal imitations. The proposed system si
Authors
(none)
Tags
Stats
Related papers
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)19.60
- AVQVC: One-shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022)12.40
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Learning Acoustic Word Embeddings With Temporal Context For Query-by-example Speech Search (2018)9.92
- Semantic Query-by-example Speech Search Using Visual Grounding (2019)7.81
- VQVC+: One-shot Voice Conversion By Vector Quantization And U-net Architecture (2020)13.34
- Exploring Efficient-tuned Learning Audio Representation Method From Brivl (2023)0.00