Show And Speak: Directly Synthesize Spoken Description Of Images
2020 Β· Xinsheng Wang, Siyuan Feng, Jihua Zhu, et al.
Abstract
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
Authors
(none)
Tags
Stats
Related papers
- Seeing What You Say: Expressive Image Generation From Speech (2025)0.00
- Text-free Image-to-speech Synthesis Using Learned Segmental Units (2020)10.85
- S2IGAN: Speech-to-image Generation Via Adversarial Learning (2020)9.23
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- Syneslm: A Unified Approach For Audio-visual Speech Recognition And Translation Via Language Model And Synthetic Data (2024)0.00
- Transcription-enriched Joint Embeddings For Spoken Descriptions Of Images And Videos (2020)0.00
- Language Learning Using Speech To Image Retrieval (2019)9.41
- Improved Speech Reconstruction From Silent Video (2017)13.34