VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset
2023 Β· Sihan Chen, Handong Li, Qunbo Wang, et al.
Abstract
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vis
Authors
(none)
Tags
Stats
Related papers
- VALOR: Vision-audio-language Omni-perception Pretraining Model And Dataset (2023)10.61
- Effectively Obtaining Acoustic, Visual And Textual Data From Videos (2025)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Learning Audio-video Modalities From Image Captions (2022)12.54
- V-SAT: Video Subtitle Annotation Tool (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44