SIMMER: Cross-modal Food Image--recipe Retrieval Via Mllm-based Embedding
2026 Β· Keisuke Gomi, Keiji Yanai
Abstract
Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the mo
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Retrieval In The Cooking Context: Learning Semantic Text-image Embeddings (2018)0.00
- Recipe1m+: A Dataset For Learning Cross-modal Embeddings For Cooking Recipes And Food Images (2018)17.24
- Transformer Decoders With Multimodal Regularization For Cross-modal Food Retrieval (2022)14.17
- CHEF: Cross-modal Hierarchical Embeddings For Food Domain Retrieval (2021)8.35
- Cross-modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism (2020)12.10
- MCEN: Bridging Cross-modal Gap Between Cooking Recipes And Dish Images With Latent Variable Model (2020)13.39
- Learning TFIDF Enhanced Joint Embedding For Recipe-image Cross-modal Retrieval Service (2021)10.85
- Cross-modal Retrieval And Synthesis (X-MRS): Closing The Modality Gap In Shared Representation Learning (2020)0.00