Temporal Working Memory: Query-guided Segment Refinement For Enhanced Multimodal Understanding
2025 Β· Xingjian Diao, Chunhui Zhang, Weiyi Wu, et al.
Abstract
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This
Authors
(none)
Tags
Stats
Related papers
- SMART: Shot-aware Multimodal Video Moment Retrieval With Audio-enhanced MLLM (2025)0.00
- Efficient Audiovisual Speech Processing Via MUTUD: Multimodal Training And Unimodal Deployment (2025)0.00
- WDMIR: Wavelet-driven Multimodal Intent Recognition (2025)2.26
- Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning (2024)3.58
- Temporal Film: Capturing Long-range Sequence Dependencies With Feature-wise Modulations (2019)0.00
- Multi-modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video (2022)12.10
- Watch, Listen, And Describe: Globally And Locally Aligned Cross-modal Attentions For Video Captioning (2018)12.87
- Multi-resolution Audio-visual Feature Fusion For Temporal Action Localization (2023)0.00