Meta-transformer: A Unified Framework For Multimodal Learning
2023 Β· Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, et al.
Abstract
Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (\(\textit\{e.g.\}\) natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a \(\textbf\{frozen\}\) encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12
Authors
(none)
Tags
Stats
Related papers
- A Modular End-to-end Multimodal Learning Method For Structured And Unstructured Data (2024)0.00
- Omni-c: Compressing Heterogeneous Modalities Into A Single Dense Encoder (2026)0.00
- TMT: Tri-modal Translation Between Speech, Image, And Text By Processing Different Modalities As Different Languages (2024)2.26
- Unified Cross-modal Translation Of Score Images, Symbolic Music, And Performance Audio (2025)0.00
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- TEAL: Tokenize And Embed ALL For Multi-modal Large Language Models (2023)0.00
- Multimodal Frame-scoring Transformer For Video Summarization (2022)0.00
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00