Abstract

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (\(\textit\{e.g.\}\) natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a \(\textbf\{frozen\}\) encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12

Authors

(none)

Tags

  • Multimodal Audio

Stats

  • citations0
  • S2 citationsβ€”
  • github stars1653
  • HF likes0
  • heat score6.44
  • arxiv keyzhang2023meta

Related papers