Abstract

Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present \textbf\{AVMuST-TED\}, the first dataset for \textbf\{A\}udio-\textbf\{V\}isual \textbf\{Mu\}ltilingual \textbf\{S\}peech \textbf\{T\}ranslation, derived from \textbf\{TED\} talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap

Authors

(none)

Tags

  • Speech Translation
  • Speech Recognition
  • Text-to-Speech

Stats

  • citations13
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score8.60
  • arxiv keycheng2023mixspeech

Related papers