I-code V2: An Autoregressive Generation Framework Over Vision, Language, And Speech Data
2023 Β· Ziyi Yang, Mahmoud Khademi, Yichong Xu, et al.
Abstract
The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- an
Authors
(none)
Tags
Stats
Related papers
- VX2TEXT: End-to-end Learning Of Video-based Text Generation From Multimodal Inputs (2021)12.17
- I Hear Your True Colors: Image Guided Audio Generation (2022)0.00
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Viola: Unified Codec Language Models For Speech Recognition, Synthesis, And Translation (2023)0.00
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16