Syneslm: A Unified Approach For Audio-visual Speech Recognition And Translation Via Language Model And Synthetic Data
2024 Β· Yichen Lu, Jiaqi Song, Xuankai Chang, et al.
Abstract
In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT). Unlike previous research that focused on lip motion as visual cues for speech signals, our work explores more general visual information within entire frames, such as objects and actions. Additionally, we use synthetic image data to enhance the correlation between image and speech data. We benchmark SynesLM against the How2 dataset, demonstrating performance on par with state-of-the-art (SOTA) models dedicated to AV-ASR while maintaining our multitasking framework. Remarkably, for zero-shot AV-ASR, SynesLM achieved SOTA performance by lowering the Word Error Rate (WER) from 43.4% to 39.4% on the VisSpeech Dataset. Furthermore, our results in VST and VMT outperform the previous results, improving the BLEU score to 43.5 from 37.2 for VST, and to 54.8 from 54.4 for VMT.
Authors
(none)
Tags
Stats
Related papers
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76
- Voxtlm: Unified Decoder-only Models For Consolidating Speech Recognition/synthesis And Speech/text Continuation Tasks (2023)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00