Beyond Chunk-then-embed: A Comprehensive Taxonomy And Evaluation Of Document Chunking Strategies For Information Retrieval
2026 Β· Yongjie Zhou, Shuai Wang, Bevan Koopman, et al.
Abstract
Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chu
Authors
(none)
Tags
Stats
Related papers
- Late Chunking: Contextual Chunk Embeddings Using Long-context Embedding Models (2024)0.00
- Chunk Twice, Embed Once: A Systematic Study Of Segmentation And Representation Trade-offs In Chemistry-aware Retrieval-augmented Generation (2025)0.00
- Visual Late Chunking: An Empirical Study Of Contextual Chunking For Efficient Visual Document Retrieval (2026)0.00
- Taxonomy Of The Retrieval System Framework: Pitfalls And Paradigms (2026)0.00
- Utilizing Metadata For Better Retrieval-augmented Generation (2026)0.00
- Learning Refined Document Representations For Dense Retrieval Via Deliberate Thinking (2025)2.89
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Graph-aware Late Chunking For Retrieval-augmented Generation In Biomedical Literature (2026)0.00