Unifying Latent And Lexicon Representations For Effective Video-text Retrieval
2024 Β· Haowei Liu, Yaya Shi, Haiyang Xu, et al.
Abstract
In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learnin
Authors
(none)
Tags
Stats
Related papers
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00