Enhancing Recipe Retrieval With Foundation Models: A Data Augmentation Perspective
2023 Β· Fangzhou Song, Bin Zhu, Yanbin Hao, et al.
Abstract
Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-leve
Authors
(none)
Tags
Stats
Related papers
- Self-enhancement Improves Text-image Retrieval In Foundation Visual-language Models (2023)1.56
- MALM: Mask Augmentation Based Local Matching For Food-recipe Retrieval (2023)0.00
- SIMMER: Cross-modal Food Image--recipe Retrieval Via Mllm-based Embedding (2026)0.00
- Cross-modal Retrieval In The Cooking Context: Learning Semantic Text-image Embeddings (2018)0.00
- Cross-modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism (2020)12.10
- Transformer Decoders With Multimodal Regularization For Cross-modal Food Retrieval (2022)14.17
- Recipe1m+: A Dataset For Learning Cross-modal Embeddings For Cooking Recipes And Food Images (2018)17.24
- Images & Recipes: Retrieval In The Cooking Context (2018)3.58