Text Is MASS: Modeling As Stochastic Embedding For Text-video Retrieval
2024 Β· Jiamian Wang, Guohao Sun, Pichao Wang, et al.
Abstract
The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is
Authors
(none)
Tags
Stats
Related papers
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Stacked Convolutional Deep Encoding Network For Video-text Retrieval (2020)7.81
- Video-text Retrieval By Supervised Sparse Multi-grained Learning (2023)8.03
- Learning A Text-video Embedding From Incomplete And Heterogeneous Data (2018)4.18
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00