S2st-omni: Hierarchical Language-aware Speechllm Adaptation For Multilingual Speech-to-speech Translation
2025 Β· Yu Pan, Xiongfei Wu, Yuguang Yang, et al.
Abstract
Despite recent advances in speech-to-speech translation (S2ST), it remains difficult to achieve both high translation accuracy and practical flexibility. In this paper, we present S2ST-Omni, a compositional S2ST framework that integrates a high-accuracy speech-to-text translation (S2TT) frontend with a modular, plug-and-play text-to-speech (TTS) backend, enabling independent optimization of translation and synthesis. On the S2TT side, we introduce a hybrid adapter that follows a "local-then-global" strategy to bridge a pretrained Whisper encoder and a Qwen3 LLM, yielding a hierarchical acoustic-to-semantic abstraction. Building on this bridge, we further propose a hierarchical language-aware architecture that injects source-language information at two complementary levels. At the acoustic level, Language-Aware Dual-CTC operates on intermediate adapter features and employs FiLM-style feature modulation with a learnable gate, encouraging the model to learn language-specific but content-f
Authors
(none)
Tags
Stats
Related papers
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Efficient And Adaptive Simultaneous Speech Translation With Fully Unidirectional Architecture (2025)2.26
- MSLM-S2ST: A Multitask Speech Language Model For Textless Speech-to-speech Translation With Speaker Style Preservation (2024)0.00
- Bridging The Gaps Of Both Modality And Language: Synchronous Bilingual CTC For Speech Translation And Speech Recognition (2023)4.49
- Preserving Speaker Information In Direct Speech-to-speech Translation With Non-autoregressive Generation And Pretraining (2024)0.00
- Omnifusion: Simultaneous Multilingual Multimodal Translations Via Modular Fusion (2025)0.00
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81