Lexmae: Lexicon-bottlenecked Pretraining For Large-scale Retrieval
2022 Β· Tao Shen, Xiubo Geng, Chongyang Tao, et al.
Abstract
In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE
Authors
(none)
Tags
Stats
Related papers
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Less Is More: Pre-train A Strong Text Encoder For Dense Retrieval Using A Weak Decoder (2021)14.29
- Challenging Decoder Helps In Masked Auto-encoder Pre-training For Dense Passage Retrieval (2023)0.00
- Drop Your Decoder: Pre-training With Bag-of-word Prediction For Dense Passage Retrieval (2024)3.58
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00