MCEN: Bridging Cross-modal Gap Between Cooking Recipes And Dish Images With Latent Variable Model
2020 Β· Han Fu, Rui Wu, Chenghao Liu, et al.
Abstract
Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Retrieval In The Cooking Context: Learning Semantic Text-image Embeddings (2018)0.00
- Cross-modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism (2020)12.10
- CHEF: Cross-modal Hierarchical Embeddings For Food Domain Retrieval (2021)8.35
- SIMMER: Cross-modal Food Image--recipe Retrieval Via Mllm-based Embedding (2026)0.00
- Recipe1m+: A Dataset For Learning Cross-modal Embeddings For Cooking Recipes And Food Images (2018)17.24
- Mitigating Cross-modal Representation Bias For Multicultural Image-to-recipe Retrieval (2025)0.00
- Cross-modal Retrieval And Synthesis (X-MRS): Closing The Modality Gap In Shared Representation Learning (2020)0.00
- Transformer Decoders With Multimodal Regularization For Cross-modal Food Retrieval (2022)14.17