Deep Triplet Neural Networks With Cluster-cca For Audio-visual Cross-modal Retrieval
2019 Β· Donghuo Zeng, Yi Yu, Keizo Oyama
Abstract
Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyrics-audio.Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose a novel deep triplet neural network with cluster canonical correlation analysis(TNN-C-CCA), which is an end-to-end supervised learning architecture with audio branch and video branch.We not only consider the matching
Authors
(none)
Tags
Stats
Related papers
- Learning Joint Embedding For Cross-modal Retrieval (2019)5.84
- Audio-visual Embedding For Cross-modal Musicvideo Retrieval Through Supervised Deep CCA (2019)11.93
- Variational Autoencoder With CCA For Audio-visual Cross-modal Retrieval (2021)9.92
- End-to-end Cross-modality Retrieval With CCA Projections And Pairwise Ranking Loss (2017)14.68
- Video And Audio Are Images: A Cross-modal Mixer For Original Data On Video-audio Retrieval (2023)7.16
- Unsupervised Cross-modal Audio Representation Learning From Unstructured Multilingual Text (2020)2.26
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval (2024)4.52