A Feature-space Multimodal Data Augmentation Technique For Text-video Retrieval
2022 Β· Alex Falcon, Giuseppe Serra, Oswald Lanz
Abstract
Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EP
Authors
(none)
Tags
Stats
Related papers
- Paired Cross-modal Data Augmentation For Fine-grained Image-to-text Retrieval (2022)8.09
- DREAM: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models (2024)2.26
- Feature Re-learning With Data Augmentation For Video Relevance Prediction (2020)6.34
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00
- Mixgen: A New Multi-modal Data Augmentation (2022)14.47
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- Multi-modal Transformer For Video Retrieval (2020)19.47