CTC-GMM: CTC Guided Modality Matching For Fast And Accurate Streaming Speech Translation
2024 Β· Rui Zhao, Jinyu Li, Ruchao Fan, et al.
Abstract
Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched \{source-target\} language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively, while also boosting decoding sp
Authors
(none)
Tags
Stats
Related papers
- Bridging The Gaps Of Both Modality And Language: Synchronous Bilingual CTC For Speech Translation And Speech Recognition (2023)4.49
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Efficient CTC Regularization Via Coarse Labels For End-to-end Speech Translation (2023)0.00
- Contrastive Feedback Mechanism For Simultaneous Speech Translation (2024)2.26
- Long-form End-to-end Speech Translation Via Latent Alignment Segmentation (2023)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- STEMM: Self-learning With Speech-text Manifold Mixup For Speech Translation (2022)11.58
- Bridging The Modality Gap For Speech-to-text Translation (2020)0.00