Direct Multimodal Few-shot Learning Of Speech And Images
2020 Β· Leanne Nortje, Herman Kamper
Abstract
We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous work used a two-step indirect approach relying on learned unimodal representations: speech-speech and image-image comparisons are performed across the support set of given speech-image pairs. We propose two direct models which instead learn a single multimodal space where inputs from different modalities are directly comparable: a multimodal triplet network (MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these direct models, we mine speech-image pairs: the support set is used to pair up unlabelled in-domain speech and images. In a speech-to-image digit matching task, dir
Authors
(none)
Tags
Stats
Related papers
- Unsupervised Vs. Transfer Learning For Multimodal One-shot Matching Of Speech And Images (2020)5.24
- Multimodal One-shot Learning Of Speech And Images (2018)9.03
- Cross-modal Denoising: A Novel Training Paradigm For Enhancing Speech-image Retrieval (2024)0.00
- Transcription-enriched Joint Embeddings For Spoken Descriptions Of Images And Videos (2020)0.00
- Learning Modality-invariant Representations For Speech And Images (2017)8.09
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Metric Learning With Progressive Self-distillation For Audio-visual Embedding Learning (2025)3.58
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84