You uploaded a 200-page PDF to your custom GPT, asked it a specific question, and got a vague answer that ignores half your document. The problem isn't ChatGPT — it's how your file is structured. When you feed a large, unprocessed PDF into a custom GPT's knowledge base, the retrieval system has to guess which section is relevant, and it frequently guesses wrong.
Chunking — splitting your document into smaller, semantically meaningful pieces — is the single most impactful thing you can do to improve your custom GPT's accuracy.
Why Raw PDFs Fail in Custom GPTs
When you upload a file to a ChatGPT custom GPT, OpenAI's retrieval system indexes the content and pulls relevant sections based on the user's query. The system works best when each chunk of text is:
- Focused on one topic — a chunk about "refund policy" shouldn't also contain "shipping rates"
- Small enough to fit in the context window alongside other retrieved chunks
- Clean enough to parse — no garbled headers, page numbers, or broken table formatting
Raw PDFs violate all three. A 200-page PDF gets treated as one massive blob. The retrieval system might pull a section that starts mid-sentence because the PDF extractor split at a page boundary. Headers, footers, and page numbers get mixed into the text. Tables turn into nonsensical strings of characters.
The fix is straightforward: chunk your document into clean, focused text segments before uploading.
Chunking Strategies That Actually Work
Not all chunking approaches are equal. Here are the three most common strategies, ranked by effectiveness for custom GPTs.
1. Section-Based Chunking (Best for Most Documents)
Split your document at natural section boundaries — chapter headings, H2/H3 headers, or topic transitions. Each chunk becomes a self-contained unit that answers one type of question.
This works best because custom GPT retrieval is essentially a search problem. When a user asks "What's the refund policy?", retrieval needs to find the chunk labeled "Refund Policy" — not page 47 of a 200-page document.
import re
def chunk_by_sections(text: str, max_chunk_size: int = 2000) -> list[str]:
"""Split text at markdown-style headers."""
sections = re.split(r'\n(?=#{1,3}\s)', text)
chunks = []
for section in sections:
section = section.strip()
if not section:
continue
# If a section is too long, split at paragraph boundaries
if len(section) > max_chunk_size:
paragraphs = section.split('\n\n')
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) > max_chunk_size and current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n\n" + para
if current_chunk.strip():
chunks.append(current_chunk.strip())
else:
chunks.append(section)
return chunks2. Fixed-Size Chunking with Overlap
Split text into chunks of a fixed token or character count, with a small overlap between consecutive chunks so context isn't lost at boundaries.
This is the fallback when your document doesn't have clear section headers — think OCR'd scans, legal documents with minimal formatting, or raw text dumps.
def chunk_fixed_size(text: str, chunk_size: int = 1500, overlap: int = 200) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at a sentence boundary
if end < len(text):
last_period = chunk.rfind('. ')
if last_period > chunk_size * 0.7:
end = start + last_period + 1
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap
return chunks3. Semantic Chunking (Advanced)
Use an embedding model to detect topic shifts and split at semantic boundaries. This produces the highest-quality chunks but requires more setup — you need an embedding API and some clustering logic.
For most custom GPT use cases, section-based chunking gets you 90% of the way there. Semantic chunking matters more when you're building a production RAG pipeline with thousands of documents.
The Right Chunk Size for Custom GPTs
OpenAI hasn't published exact retrieval window sizes for custom GPTs, but based on testing, these guidelines hold:
- Target 800–2,000 characters per chunk (roughly 200–500 tokens)
- Never exceed 4,000 characters — larger chunks reduce retrieval precision
- Minimum 200 characters — smaller chunks lack enough context to be useful
- Include the section title at the top of each chunk so retrieval has a clear signal
If your chunks are too large, the retrieval system pulls in too much irrelevant text alongside the relevant answer. Too small, and the system might not retrieve enough context to form a coherent response.
Cleaning Your PDF Before Chunking
Chunking a messy PDF just gives you messy chunks. Before you split, clean the extracted text:
- Strip headers and footers — page numbers, document titles, and running headers add noise
- Fix line breaks — PDF extractors often insert hard line breaks mid-sentence at page-width boundaries
- Normalize whitespace — remove double spaces, tab characters, and trailing whitespace
- Handle tables — convert tables to a readable text format (markdown tables or key-value pairs) rather than letting the extractor produce column-jumbled text
- Remove boilerplate — table of contents, copyright notices, and blank pages don't need to be in your knowledge base
import re
def clean_pdf_text(text: str) -> str:
"""Clean common PDF extraction artifacts."""
# Fix hyphenated line breaks
text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)
# Merge broken lines (not paragraph breaks)
text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
# Normalize whitespace
text = re.sub(r' {2,}', ' ', text)
# Remove page numbers on their own line
text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
return text.strip()Putting It All Together
Here's the complete workflow for preparing a PDF for a custom GPT:
- Extract text from the PDF using a tool that handles formatting (not just raw text extraction)
- Clean the extracted text — fix line breaks, strip headers/footers, normalize whitespace
- Chunk using section-based splitting with a max size of ~2,000 characters
- Review a few chunks manually to verify quality
- Save each chunk as a separate
.txtfile or combine into one file with clear separators - Upload to your custom GPT's knowledge base
For step 5, a common format is one .txt file per chunk, named descriptively:
refund-policy.txt
shipping-rates.txt
product-specifications.txt
account-management.txt
Or a single file with separators:
=== Refund Policy ===
[chunk content]
=== Shipping Rates ===
[chunk content]
Skip the Manual Work
If this sounds like a lot of steps for every document you want to add to a custom GPT, it is. That's exactly the problem Knowledge Builder Pro solves.
Upload your PDF (or DOCX, CSV, HTML, Markdown), and Knowledge Builder Pro extracts the text, cleans formatting artifacts, chunks it by topic with the right size for AI retrieval, and outputs clean .txt files ready for your custom GPT, Claude project, or RAG pipeline. The entire process takes seconds instead of the hour-plus you'd spend doing it by hand.
The chunking and cleaning logic is purpose-built for AI knowledge bases — not just generic text extraction. That difference shows up directly in how well your custom GPT answers questions.