Mixspeech: Cross-modality Self-learning With Audio-visual Stream Mixup For Visual Speech Translation And Recognition
2023 Β· Xize Cheng, Linjun Li, Tao Jin, et al.
Abstract
Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present \textbf\{AVMuST-TED\}, the first dataset for \textbf\{A\}udio-\textbf\{V\}isual \textbf\{Mu\}ltilingual \textbf\{S\}peech \textbf\{T\}ranslation, derived from \textbf\{TED\} talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap
Authors
(none)
Tags
Stats
Related papers
- Efficient Audiovisual Speech Processing Via MUTUD: Multimodal Training And Unimodal Deployment (2025)0.00
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- STEMM: Self-learning With Speech-text Manifold Mixup For Speech Translation (2022)11.58
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Joint Training And Decoding For Multilingual End-to-end Simultaneous Speech Translation (2025)0.95
- Syneslm: A Unified Approach For Audio-visual Speech Recognition And Translation Via Language Model And Synthetic Data (2024)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24