Omnifusion: Simultaneous Multilingual Multimodal Translations Via Modular Fusion
2025 Β· Sai Koneru, Matthias Huck, Jan Niehues
Abstract
There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a
Authors
(none)
Tags
Stats
Related papers
- Fused Acoustic And Text Encoding For Multimodal Bilingual Pretraining And Speech Translation (2021)0.00
- S2st-omni: Hierarchical Language-aware Speechllm Adaptation For Multilingual Speech-to-speech Translation (2025)0.00
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Let's Fuse Step By Step: A Generative Fusion Decoding Algorithm With Llms For Robust And Instruction-aware ASR And OCR (2024)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- MCAT: Scaling Many-to-many Speech-to-text Translation With Mllms To 70 Languages (2025)2.41
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84