Video And Audio Are Images: A Cross-modal Mixer For Original Data On Video-audio Retrieval
2023 Β· Zichen Yuan, Qi Shen, Bingyi Zheng, et al.
Abstract
Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two mo
Authors
(none)
Tags
Stats
Related papers
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Variational Autoencoder With CCA For Audio-visual Cross-modal Retrieval (2021)9.92
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Audio-visual Embedding For Cross-modal Musicvideo Retrieval Through Supervised Deep CCA (2019)11.93
- Cross-modal Search Method Of Technology Video Based On Adversarial Learning And Feature Fusion (2022)0.00