Leveraging Generative Language Models For Weakly Supervised Sentence Component Analysis In Video-language Joint Learning
2023 Β· Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Rahul Pratap Singh, et al.
Abstract
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably
Authors
(none)
Tags
Stats
Related papers
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Vill-e: Video LLM Embeddings For Retrieval (2026)0.00
- Distilling Vision-language Models On Millions Of Videos (2024)7.50