Why Chunking Strategy Matters

Simple chunking works for flat text, but breaks structured documents.

Structured Document

Document

├─ Section 1

│ ├─ 1.1 Definitions

│ └─ 1.2 Scope

├─ Section 2

│ ├─ 2.1 Requirements

│ └─ 2.2 Exceptions

└─ Section 3

└─ 3.1 Procedures

Fixed Token

Output Chunks

chunk_001: "...1.1 Definitions. Key terms used: "Service" means the platform. "User" means any person who 1.2 Scope. This agreement applies to..."

chunk_002: "...all users regardless of location. 2.1 Requirements. Users must: (a) be 18 years or older (b) provide accurate info..."

chunk_003: "...2.2 Exceptions. The following are exempt from Section 2.1: employees of Partner companies, users in..."

Arbitrary boundaries split sections mid-thought.

Industry-Leading Retrieval Performance

Evaluated on 3 public structured-document benchmarks spanning legal, academic, and technical standards. 697 documents and 22,000+ gold-standard questions.

Average across all benchmarks (macro-averaged)

Method	Recall@1	Recall@5	MRR@5	nDCG@5	Chunks/Doc	Tokens/Chunk	Efficiency
Fixed Token (500)	0.34	0.67	0.46	0.48	32	492	1.37
LangChain RecursiveCharacterTextSplitter	0.31	0.66	0.44	0.46	38	398	1.67
Docling HierarchicalChunker	0.28	0.59	0.39	0.41	171	376	1.56
Flat Header Splitter	0.47	0.84	0.61	0.66	11	1083	0.78
DocSlicer	0.58	0.86	0.69	0.70	38	374	2.30

Evaluation methodology

All methods use identical embeddings (text-embedding-3-small), similarity metric (cosine), and top-k retrieval. Only chunking strategy differs.

View full benchmark code on GitHub

Benchmark Datasets (697 documents, 22,000+ questions):

•

CUAD (Contract Understanding Atticus Dataset) — 510 legal contracts, 20,910 questions

Expert-annotated contracts with clause-level QA pairs covering 41 legal clause types

•

ACL Anthology — 33 academic papers, 264 questions

NLP research papers with information-seeking questions requiring evidence from full text

•

IETF RFC Corpus — 154 technical RFCs, 1,070 questions

Internet standards documents (HTTP/2, HTTP/3, TLS 1.3, QUIC, etc.) with generated QA pairs

Pipeline:

Vector search with text-embedding-3-small, cosine similarity, top-5 retrieval. Relevance determined by answer span overlap with chunk content.

Variable:

Chunking strategy only; all other components (embedding model, similarity metric, retrieval depth) held constant.

Recall@k

Percentage of questions where the answer was found in the top k results

MRR (Mean Reciprocal Rank)

How quickly the answer appears. If answer is rank 1: score=1.0, rank 2: score=0.5, rank 3: score=0.33, etc.

nDCG (Normalized DCG)

Overall ranking quality on a 0-1 scale. Higher means relevant results appear earlier and more consistently.

Context Efficiency

Quality per token cost. Measures recall achieved relative to total context sent to the LLM.

Better RAG Outcomes with DocSlicer

See how DocSlicer's structure-aware chunking delivers more accurate answers compared to simple chunking methods.

Apple 10-K Filing

SEC filing with financial statements and risk factors

Open in new tab

Ask a Question

Compare Simple Chunking vs DocSlicer responses

Select a question and click send to see the comparison.

Review the document on the left to understand the context.