Text-to-video: A Two-stage Framework For Zero-shot Identity-agnostic Talking-head Generation
2023 Β· Zhichao Wang, Mengyu Dai, Keld Lundgaard
Abstract
The advent of ChatGPT has introduced innovative methods for information gathering and analysis. However, the information provided by ChatGPT is limited to text, and the visualization of this information remains constrained. Previous research has explored zero-shot text-to-video (TTV) approaches to transform text into videos. However, these methods lacked control over the identity of the generated audio, i.e., not identity-agnostic, hindering their effectiveness. To address this limitation, we propose a novel two-stage framework for person-agnostic video cloning, specifically focusing on TTV generation. In the first stage, we leverage pretrained zero-shot models to achieve text-to-speech (TTS) conversion. In the second stage, an audio-driven talking head generation method is employed to produce compelling videos privided the audio generated in the first stage. This paper presents a comparative analysis of different TTS and audio-driven talking head generation methods, identifying the mo
Authors
(none)
Tags
Stats
Related papers
- Text-driven Talking Face Synthesis By Reprogramming Audio-driven Models (2023)2.26
- Talkverse: Democratizing Minute-long Audio-driven Video Generation (2025)0.00
- Maskgct: Zero-shot Text-to-speech With Masked Generative Codec Transformer (2024)7.98
- Transface: Unit-based Audio-visual Speech Synthesizer For Talking Head Translation (2023)7.16
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- A Unified Compression Framework For Efficient Speech-driven Talking-face Generation (2023)0.00