Talk, Don't Write: A Study Of Direct Speech-based Image Retrieval
2021 Β· Ramon Sanabria, Austin Waters, Jason Baldridge
Abstract
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
Authors
(none)
Tags
Stats
Related papers
- Recqr: Incorporating Conversational Query Rewriting To Improve Multimodal Image Retrieval (2026)0.00
- You'll Never Walk Alone: A Sketch And Text Duet For Fine-grained Image Retrieval (2024)9.41
- Speaker Retrieval In The Wild: Challenges, Effectiveness And Robustness (2025)2.26
- Adapting Dual-encoder Vision-language Models For Paraphrased Retrieval (2024)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Chatsearch: A Dataset And A Generative Retrieval Model For General Conversational Image Retrieval (2024)2.00
- Speech-image Semantic Alignment Does Not Depend On Any Prior Classification Tasks (2020)3.58
- Telling The What While Pointing To The Where: Multimodal Queries For Image Retrieval (2021)10.07