Gistembed: Guided In-sample Selection Of Training Negatives For Text Embedding Fine-tuning
2024 Β· Aivin V. Solatorio
Abstract
Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and a
Authors
(none)
Tags
Stats
Related papers
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Improving Embedding With Contrastive Fine-tuning On Small Datasets With Expert-augmented Scores (2024)0.00
- Negative Sample Is Negative In Its Own Way: Tailoring Negative Sentences For Image-text Retrieval (2021)3.81
- Efficient Fine-tuning Methodology Of Text Embedding Models For Information Retrieval: Contrastive Learning Penalty (clp) (2024)2.16
- Improved Embeddings With Easy Positive Triplet Mining (2019)15.06
- Multitask Text-to-visual Embedding With Titles And Clickthrough Data (2019)0.00
- VSE++: Improving Visual-semantic Embeddings With Hard Negatives (2017)0.00
- Your Negative May Not Be True Negative: Boosting Image-text Matching With False Negative Elimination (2023)14.32