Evaluating Embedding Models And Pipeline Optimization For AI Search Quality
2025 Β· Philip Zhong, Kent Chen, Don Wang
Abstract
We evaluate the performance of various text embedding models and pipeline configurations for AI-driven search systems. We compare sentence-transformer and generative embedding models (e.g., All-MPNet, BGE, GTE, and Qwen) at different dimensions, indexing methods (Milvus HNSW/IVF), and chunking strategies. A custom evaluation dataset of 11,975 query-chunk pairs was synthesized from US City Council meeting transcripts using a local large language model (LLM). The data pipeline includes preprocessing, automated question generation per chunk, manual validation, and continuous integration/continuous deployment (CI/CD) integration. We measure retrieval accuracy using reference-based metrics: Top-K Accuracy and Normalized Discounted Cumulative Gain (NDCG). Our results demonstrate that higher-dimensional embeddings significantly boost search quality (e.g., Qwen3-Embedding-8B/4096 achieves Top-3 accuracy about 0.571 versus 0.412 for GTE-large/1024), and that neural re-rankers (e.g., a BGE cross
Authors
(none)
Tags
Stats
Related papers
- Evaluating Embedding Apis For Information Retrieval (2023)8.09
- Enhancing Question Answering Precision With Optimized Vector Retrieval And Instructions (2024)0.00
- Rethinking Hybrid Retrieval: When Small Embeddings And LLM Re-ranking Beat Bigger Models (2025)0.00
- Beyond Retrieval: Ensembling Cross-encoders And GPT Rerankers With Llms For Biomedical QA (2025)0.00
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Dense Retrievers Can Fail On Simple Queries: Revealing The Granularity Dilemma Of Embeddings (2025)2.86
- Optimizing Retrieval-augmented Generation: Analysis Of Hyperparameter Impact On Performance And Efficiency (2025)0.00
- A Multi-resolution Word Embedding For Document Retrieval From Large Unstructured Knowledge Bases (2019)0.00