Text Embeddings By Weakly-supervised Contrastive Pre-training
2022 Β· Liang Wang, Nan Yang, Xiaolong Huang, et al.
Abstract
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
Authors
(none)
Tags
Stats
Related papers
- Text And Code Embeddings By Contrastive Pre-training (2022)0.00
- Improving Embedding With Contrastive Fine-tuning On Small Datasets With Expert-augmented Scores (2024)0.00
- Efficient Fine-tuning Methodology Of Text Embedding Models For Information Retrieval: Contrastive Learning Penalty (clp) (2024)2.16
- Videoclip: Contrastive Pre-training For Zero-shot Video-text Understanding (2021)28.04
- Unsupervised Context Aware Sentence Representation Pretraining For Multi-lingual Dense Retrieval (2022)3.58
- Less Is More: Pre-train A Strong Text Encoder For Dense Retrieval Using A Weak Decoder (2021)14.29
- Refining Joint Text And Source Code Embeddings For Retrieval Task With Parameter-efficient Fine-tuning (2024)0.00
- Pre-train A Discriminative Text Encoder For Dense Retrieval Via Contrastive Span Prediction (2022)10.21