TMT: Tri-modal Translation Between Speech, Image, And Text By Processing Different Modalities As Different Languages
2024 Β· Minsu Kim, Jee-Weon Jung, Hyeongseop Rha, et al.
Abstract
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single mod
Authors
(none)
Tags
Stats
Related papers
- Multimodal Machine Translation Through Visuals And Speech (2019)12.68
- TMT: A Transformer-based Modal Translator For Improving Multimodal Sequence Representations In Audio Visual Scene-aware Dialog (2020)5.24
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Meta-transformer: A Unified Framework For Multimodal Learning (2023)6.44
- Efficient Audiovisual Speech Processing Via MUTUD: Multimodal Training And Unimodal Deployment (2025)0.00
- Unified Cross-modal Translation Of Score Images, Symbolic Music, And Performance Audio (2025)0.00
- TEAL: Tokenize And Embed ALL For Multi-modal Large Language Models (2023)0.00
- Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing (2021)6.32