C3net: Compound Conditioned Controlnet For Multimodal Content Generation
2023 Β· Juntao Zhang, Yuehuai Liu, Yu-Wing Tai, et al.
Abstract
We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions instead of simply taking linear interpolations on the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only
Authors
(none)
Tags
Stats
Related papers
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Any-to-any Generation Via Composable Diffusion (2023)7.16
- Bandcondinet: Parallel Transformers-based Conditional Popular Music Generation With Multi-view Features (2024)0.00
- A Unified Neural Architecture For Instrumental Audio Tasks (2019)0.00
- Towards Lightweight Controllable Audio Synthesis With Conditional Implicit Neural Representations (2021)0.00
- M3D-GAN: Multi-modal Multi-domain Translation With Universal Attention (2019)0.00
- Conditional Hybrid GAN For Sequence Generation (2020)0.00
- Editing Music With Melody And Text: Using Controlnet For Diffusion Transformer (2024)5.84