Multi-modal Dense Video Captioning
2020 Β· Vladimir Iashin, Esa Rahtu
Abstract
Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert
Authors
(none)
Tags
Stats
Related papers
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61
- Watch, Listen, And Describe: Globally And Locally Aligned Cross-modal Attentions For Video Captioning (2018)12.87
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Classifier-guided Captioning Across Modalities (2025)0.00
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset (2023)14.55