Cross-modal Discrete Representation Learning
2021 Β· Alexander H. Liu, Souyoung Jin, Cheng-I Jeff Lai, et al.
Abstract
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame
Authors
(none)
Tags
Stats
Related papers
- Semantic Residual For Multimodal Unified Discrete Representation (2024)4.52
- Joint Representation Learning And Novel Category Discovery On Single- And Multi-modal Data (2021)13.11
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Discriminative Cross-view Binary Representation Learning (2018)4.52
- Learning Deep Representation Of Multityped Objects And Tasks (2016)0.00
- Generalized Multi-view Embedding For Visual Recognition And Cross-modal Retrieval (2016)14.69
- Unified Representation Learning For Cross Model Compatibility (2020)5.24
- Learning Shared Representations From Unpaired Data (2025)0.00