Multimodal Transformer Networks For End-to-end Video-grounded Dialogue Systems
2019 Β· Hung Le, Doyen Sahoo, Nancy F. Chen, et al.
Abstract
Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding
Authors
(none)
Tags
Stats
Related papers
- TMT: A Transformer-based Modal Translator For Improving Multimodal Sequence Representations In Audio Visual Scene-aware Dialog (2020)5.24
- DSTC8-AVSD: Multimodal Semantic Transformer Network With Retrieval Style Word Generator (2020)0.00
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Mechanisms Of Multimodal Synchronization: Insights From Decoder-based Video-text-to-speech Synthesis (2024)0.00
- VX2TEXT: End-to-end Learning Of Video-based Text Generation From Multimodal Inputs (2021)12.17
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61
- End-to-end Generative Pretraining For Multimodal Video Captioning (2022)15.85