4bit-quantization In Vector-embedding For RAG
2025 · Taehee Jeong
Abstract
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has s
Authors
(none)
Tags
Stats
Related papers
- Optimization Of Embeddings Storage For RAG Systems Using Quantization And Dimensionality Reduction Techniques (2025)0.00
- SRAG: RAG With Structured Data Improves Vector Retrieval (2026)0.00
- Ragdb: A Zero-dependency, Embeddable Architecture For Multimodal Retrieval-augmented Generation On The Edge (2025)0.00
- Self-aware Vector Embeddings For Retrieval-augmented Generation: A Neuroscience-inspired Framework For Temporal, Confidence-weighted, And Relational Knowledge (2026)0.00
- Xrag: Extreme Context Compression For Retrieval-augmented Generation With One Token (2024)7.81
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- Hetarag: Hybrid Deep Retrieval-augmented Generation Across Heterogeneous Data Stores (2025)3.27
- Edgerag: Online-indexed RAG For Edge Devices (2024)0.00