How to Chunk Documents for a RAG Pipeline

Knowledge Builder Pro Team8 min read

Introduction

Most RAG failures aren't a model problem. They're a chunking problem. The retriever returns the wrong slice of text because chunks were sliced at arbitrary character boundaries, broke a thought in half, or smashed unrelated paragraphs together — and the LLM then answers from rubble.

Knowing how to chunk documents for a RAG pipeline is the difference between a retriever that surfaces the right passage on the first hit and one that needs three follow-up queries to land. The decisions are simple in theory and brutally consequential in practice.

What chunking does (and why size matters)

A retrieval-augmented generation pipeline embeds chunks of text into a vector database, retrieves the top-k closest matches to a user query, and feeds those chunks into the LLM's context window. Chunks are the unit of retrieval. Whatever you put inside a chunk is the smallest answer your system can return.

The embedding model doesn't read your document. It reads each chunk independently, projects it into a vector, and the retriever ranks vectors. A chunk of 800 tokens covering three different topics produces a blurry average embedding — useful for nothing specific. A chunk of 60 tokens of a single tight thought produces a sharp one.

Chunk size is the first decision and the one most teams get wrong by copying the default in their RAG framework's tutorial. Defaults vary wildly:

  • LangChain's RecursiveCharacterTextSplitter: 1000 chars, 200 overlap
  • LlamaIndex SentenceSplitter: 1024 tokens, 20 overlap
  • Pinecone tutorial: 256 tokens
  • Many production pipelines: 200–400 tokens with 10–20% overlap

A heuristic that actually holds up:

  1. Find the smallest unit a user would ask about — a FAQ entry, a function definition, a paragraph in a manual. That's your floor.
  2. Find the largest single coherent block in your source — a procedure, a clause, a complete code example. That's your ceiling.
  3. Pick a size that fits one of those units cleanly, not the average of them.

For most knowledge bases, 300–500 tokens is the sweet spot. Below 200 tokens you lose enough context that retrieval becomes noisy. Above 800 tokens, retrieval precision drops because each chunk's embedding gets blurry across multiple topics. The whole point of learning how to chunk documents for a RAG pipeline is to dial that size to your domain rather than copy a tutorial.

Fixed-size vs semantic chunking (and why overlap matters)

Two camps:

Fixed-size chunking splits at character or token counts (e.g., every 400 tokens with 50-token overlap). Easy to implement. Predictable. Brutally bad at respecting where ideas actually start and end. A heading lands at the end of one chunk; the body of that section lands in the next. The retriever sees the heading without the body and fails the query.

Semantic chunking splits at meaningful boundaries — paragraph breaks, section headings, topic shifts detected by embedding similarity drops. More work to implement. Far better retrieval quality on real documents. Tools that do this well: spaCy's sentence segmentation as a base plus a topic-similarity check between adjacent chunks; LlamaIndex's SemanticSplitterNodeParser; or your own pipeline that respects markdown structure if your source is already structured.

If you're starting from scratch, default to semantic chunking. The marginal cost is small; the retrieval quality jump is large. If your source documents are noisy PDFs with broken layouts, clean and structure first — semantic chunking on garbage produces semantically-bounded garbage.

Then add overlap. Overlap is the cheapest reliability trick in RAG. Each chunk shares a few sentences with its neighbors so an answer that straddles a boundary still appears intact in at least one chunk.

  • 10–20% of chunk size is the standard starting point. For 400-token chunks, that's 40–80 tokens.
  • Higher overlap (up to 25%) helps when documents are dense and ideas span multiple sentences.
  • Lower overlap (5%) is fine when you're chunking on hard structural boundaries like one chunk per FAQ Q&A — boundaries already match the answer unit.

Zero overlap is a trap unless you've chunked at semantic units that you're sure contain whole answers. The first time a user asks something that spans a boundary, they get a nonsense answer.

How to chunk documents for a RAG pipeline: a working example

Here's a minimal pipeline that respects the principles above:

from pypdf import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
 
# Step 1: extract text, page by page, dropping headers/footers
reader = PdfReader("manual.pdf")
pages = []
for page in reader.pages:
    text = page.extract_text()
    text = strip_headers_and_footers(text)  # custom regex on repeated lines
    pages.append(text)
 
full_text = "\n\n".join(pages)
 
# Step 2: split on semantic boundaries first, fall back to characters
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1600,        # ~400 tokens at 4 chars/token
    chunk_overlap=240,      # 15% overlap
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(full_text)
 
# Step 3: embed and store
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(chunks, normalize_embeddings=True)
# upsert (chunk_id, chunk_text, embedding, metadata) into your vector DB

The separators list does most of the work. It tells the splitter to prefer markdown headings, then paragraph breaks, then sentence breaks, before falling back to character splits. That ordering is the difference between chunks that respect document structure and chunks that don't.

If your source isn't already structured — raw scanned PDFs, exports from older systems, mixed Word docs — the cleaning step is where most pipelines die. Strip running headers, repeated footers, page numbers, and table-of-contents detritus before you split. Knowledge Builder Pro handles that cleanup automatically and outputs chunked files ready to embed, which saves a lot of the preprocessing work when you'd rather not maintain a custom pipeline.

Common chunking mistakes that wreck retrieval

Three mistakes show up in nearly every broken RAG pipeline.

Chunking before cleaning. Splitting a PDF that still contains running headers, footers, page numbers, and TOC artifacts means every chunk starts with junk. The embedding picks up that noise and your retriever ranks the wrong passages first. Strip layout artifacts before you split.

Stripping all structure to plain text. Markdown headings, lists, and code fences are signal. They tell the embedding model where ideas start and stop. Flatten them and you've thrown away free structure. If your source is markdown or has headings, keep them.

Treating tables and code like prose. A table chunked at character boundaries becomes gibberish in vector space. Either keep tables intact as a single chunk with a short text description prepended, or convert them into structured rows. Code blocks should always be kept whole — chopping a function in half is the fastest path to hallucinated answers.

A fourth, less obvious mistake: not measuring. Build an eval set of 30–50 real questions paired with the expected source passage, then run your retriever and check top-3 hit rate. If your hit rate is below 80%, the answer is almost always in the chunking, not the model. Adjust size, overlap, or boundary strategy and re-measure.

When to skip chunking in your RAG pipeline entirely

Some RAG-shaped problems don't actually need a RAG pipeline. If your full corpus fits inside the model's context window with room for a query and an answer (Claude 3 Opus 200k, Gemini 1.5 Pro 1M+), and latency and cost aren't issues, feed everything in. Chunking adds engineering surface area. If you don't need it, don't build it.

The threshold flips when:

  • Corpus exceeds the context window
  • Cost-per-query matters at scale
  • You need filtered retrieval — only answer from this folder, this client's docs
  • You want citation-grade traceability ("answer came from chunk 42")

For ChatGPT custom GPTs and Claude Projects, you don't control the chunking — the platform does it. Your job is to give it cleanly structured source files. Send a 250-page PDF with broken tables and the platform's retriever will struggle no matter how good your prompt is. Send pre-cleaned, semantically chunked files and retrieval gets sharp.

Wrapping Up

Chunking is the most quietly important decision in a RAG pipeline. Knowing how to chunk documents for a RAG pipeline well comes down to four things: pick a size that matches the answer unit in your domain, prefer semantic boundaries over fixed splits, add modest overlap as cheap insurance, and clean your source before you split. Get those right and your retriever does its job. Get them wrong and no model in the world will compensate.

If you'd rather skip building the preprocessing yourself, Knowledge Builder Pro cleans your documents, chunks them on semantic boundaries, and outputs files ready for ChatGPT, Claude, or your own vector pipeline — all in-memory, no files stored. Run it on a real document and compare your retrieval hit rate before and after.

Stop wrestling with messy documents

Knowledge Builder Pro converts your PDFs, DOCX, and other files into clean, chunked knowledge base files optimized for ChatGPT, Claude, and RAG pipelines.

Related articles