Stacked Convolutional Deep Encoding Network For Video-text Retrieval
2020 Β· Rui Zhao, Kecheng Zheng, Zheng-Jun Zha
Abstract
Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details. In this paper, we propose a stacked convolutional deep encoding network for video-text retrieval task, which considers to simultaneously encode long-range and short-range dependency in the videos and texts. Specifically, a multi-scale dilated convolutional (MSDC) block within our approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer. A stacked structure is designed to expand the receptive fields by repeatedly adopting the MSDC block, which further captures the long-range relations between these cues. Moreover, to obtain more robust textual representations, we fully utilize the p
Authors
(none)
Tags
Stats
Related papers
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Deep Semantic Multimodal Hashing Network For Scalable Image-text And Video-text Retrievals (2019)14.43
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00