Mmtrail: A Multimodal Trailer Video Dataset With Language And Music Descriptions
2024 Β· Xiaowei Chi, Yatian Wang, Aosong Cheng, et al.
Abstract
Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (
Authors
(none)
Tags
Stats
Related papers
- Multilevel Profiling Of Situation And Dialogue-based Deep Networks For Movie Genre Classification Using Movie Trailers (2021)0.00
- Effectively Obtaining Acoustic, Visual And Textual Data From Videos (2025)0.00
- Movie Trailer Genre Classification Using Multimodal Pretrained Features (2024)7.50
- VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset (2023)14.55
- Mumu-llama: Multi-modal Music Understanding And Generation Via Large Language Models (2024)6.34
- Teasergen: Generating Teasers For Long Documentaries (2024)0.00
- Mm-narrator: Narrating Long-form Videos With Multimodal In-context Learning (2023)10.35
- Learning Audio-video Modalities From Image Captions (2022)12.54