Speech-image Semantic Alignment Does Not Depend On Any Prior Classification Tasks
2020 Β· Masood S. Mortazavi
Abstract
Semantically-aligned \((speech, image)\) datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in \(speech \rightarrow image\) and \(image \rightarrow speech\) queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: \((speech,image)\) semantic alignment and \(speech \rightarrow image\) and \(image \rightarrow speech\) retrieval are canonical tasks worthy of independent investi
Authors
(none)
Tags
Stats
Related papers
- See, Hear, And Read: Deep Aligned Representations (2017)0.00
- Talk, Don't Write: A Study Of Direct Speech-based Image Retrieval (2021)6.77
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Symbolic Inductive Bias For Visually Grounded Learning Of Spoken Language (2018)5.24
- An Analysis Of Semantically-aligned Speech-text Embeddings (2022)7.81
- Learning Modality-invariant Representations For Speech And Images (2017)8.09
- Is Cross-modal Information Retrieval Possible Without Training? (2023)0.00
- Hierarchy-based Image Embeddings For Semantic Image Retrieval (2018)13.84