Mitigating Cross-modal Representation Bias For Multicultural Image-to-recipe Retrieval
2025 Β· Qing Wang, Chong-Wah Ngo, Yu Cao, et al.
Abstract
Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these
Authors
(none)
Tags
Stats
Related papers
- Towards Unbiased Cross-modal Representation Learning For Food Image-to-recipe Retrieval (2025)0.00
- Cross-modal Retrieval In The Cooking Context: Learning Semantic Text-image Embeddings (2018)0.00
- Cross-lingual Adaptation For Recipe Retrieval With Mixup (2022)5.84
- Cross-modal Retrieval And Synthesis (X-MRS): Closing The Modality Gap In Shared Representation Learning (2020)0.00
- SIMMER: Cross-modal Food Image--recipe Retrieval Via Mllm-based Embedding (2026)0.00
- MCEN: Bridging Cross-modal Gap Between Cooking Recipes And Dish Images With Latent Variable Model (2020)13.39
- Cross-modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism (2020)12.10
- CHEF: Cross-modal Hierarchical Embeddings For Food Domain Retrieval (2021)8.35