Asr-enhanced Multimodal Representation Learning For Cross-domain Product Retrieval
2024 Β· Ruixiang Zhao, Jian Jia, Yan Li, et al.
Abstract
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify
Authors
(none)
Tags
Stats
Related papers
- AFMRL: Attribute-enhanced Fine-grained Multi-modal Representation Learning In E-commerce (2026)0.00
- Multimodal Semantic Retrieval For Product Search (2025)3.58
- Product1m: Towards Weakly Supervised Instance-level Product Retrieval Via Cross-modal Pretraining (2021)12.61
- Entity-graph Enhanced Cross-modal Pretraining For Instance-level Product Retrieval (2022)5.24
- ACE-BERT: Adversarial Cross-modal Enhanced BERT For E-commerce Retrieval (2021)0.00
- Optimizing Product Deduplication In E-commerce With Multimodal Embeddings (2025)0.00
- MRSE: An Efficient Multi-modality Retrieval System For Large Scale E-commerce (2024)0.00
- Semantic-enhanced Modality-asymmetric Retrieval For Online E-commerce Search (2025)0.00