Continual Text-to-video Retrieval With Frame Fusion And Task-aware Routing
2025 Β· Zecheng Zhao, Zhi Chen, Zi Huang, et al.
Abstract
Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across
Authors
(none)
Tags
Stats
Related papers
- TVPR: Text-to-video Person Retrieval And A New Benchmark (2023)2.26
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00