How to Chunk a PDF for ChatGPT Custom GPTs

You uploaded a 200-page PDF to your custom GPT, asked it a specific question, and got a vague answer that ignores half your document. The problem isn't ChatGPT — it's how your file is structured. When you feed a large, unprocessed PDF into a custom GPT's knowledge base, the retrieval system has to guess which section is relevant, and it frequently guesses wrong.

Chunking — splitting your document into smaller, semantically meaningful pieces — is the single most impactful thing you can do to improve your custom GPT's accuracy.

Why Raw PDFs Fail in Custom GPTs

When you upload a file to a ChatGPT custom GPT, OpenAI's retrieval system indexes the content and pulls relevant sections based on the user's query. The system works best when each chunk of text is:

Focused on one topic — a chunk about "refund policy" shouldn't also contain "shipping rates"
Small enough to fit in the context window alongside other retrieved chunks
Clean enough to parse — no garbled headers, page numbers, or broken table formatting

Raw PDFs violate all three. A 200-page PDF gets treated as one massive blob. The retrieval system might pull a section that starts mid-sentence because the PDF extractor split at a page boundary. Headers, footers, and page numbers get mixed into the text. Tables turn into nonsensical strings of characters.

The fix is straightforward: chunk your document into clean, focused text segments before uploading.

Chunking Strategies That Actually Work

Not all chunking approaches are equal. Here are the three most common strategies, ranked by effectiveness for custom GPTs.

1. Section-Based Chunking (Best for Most Documents)

Split your document at natural section boundaries — chapter headings, H2/H3 headers, or topic transitions. Each chunk becomes a self-contained unit that answers one type of question.

This works best because custom GPT retrieval is essentially a search problem. When a user asks "What's the refund policy?", retrieval needs to find the chunk labeled "Refund Policy" — not page 47 of a 200-page document.

import re
 
def chunk_by_sections(text: str, max_chunk_size: int = 2000) -> list[str]:
    """Split text at markdown-style headers."""
    sections = re.split(r'\n(?=#{1,3}\s)', text)
    
    chunks = []
    for section in sections:
        section = section.strip()
        if not section:
            continue
        # If a section is too long, split at paragraph boundaries
        if len(section) > max_chunk_size:
            paragraphs = section.split('\n\n')
            current_chunk = ""
            for para in paragraphs:
                if len(current_chunk) + len(para) > max_chunk_size and current_chunk:
                    chunks.append(current_chunk.strip())
                    current_chunk = para
                else:
                    current_chunk += "\n\n" + para
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
        else:
            chunks.append(section)
    
    return chunks

2. Fixed-Size Chunking with Overlap

Split text into chunks of a fixed token or character count, with a small overlap between consecutive chunks so context isn't lost at boundaries.

This is the fallback when your document doesn't have clear section headers — think OCR'd scans, legal documents with minimal formatting, or raw text dumps.

def chunk_fixed_size(text: str, chunk_size: int = 1500, overlap: int = 200) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Try to break at a sentence boundary
        if end < len(text):
            last_period = chunk.rfind('. ')
            if last_period > chunk_size * 0.7:
                end = start + last_period + 1
                chunk = text[start:end]
        
        chunks.append(chunk.strip())
        start = end - overlap
    
    return chunks

3. Semantic Chunking (Advanced)

Use an embedding model to detect topic shifts and split at semantic boundaries. This produces the highest-quality chunks but requires more setup — you need an embedding API and some clustering logic.

For most custom GPT use cases, section-based chunking gets you 90% of the way there. Semantic chunking matters more when you're building a production RAG pipeline with thousands of documents.

The Right Chunk Size for Custom GPTs

OpenAI hasn't published exact retrieval window sizes for custom GPTs, but based on testing, these guidelines hold:

Target 800–2,000 characters per chunk (roughly 200–500 tokens)
Never exceed 4,000 characters — larger chunks reduce retrieval precision
Minimum 200 characters — smaller chunks lack enough context to be useful
Include the section title at the top of each chunk so retrieval has a clear signal

If your chunks are too large, the retrieval system pulls in too much irrelevant text alongside the relevant answer. Too small, and the system might not retrieve enough context to form a coherent response.

Cleaning Your PDF Before Chunking

Chunking a messy PDF just gives you messy chunks. Before you split, clean the extracted text:

Strip headers and footers — page numbers, document titles, and running headers add noise
Fix line breaks — PDF extractors often insert hard line breaks mid-sentence at page-width boundaries
Normalize whitespace — remove double spaces, tab characters, and trailing whitespace
Handle tables — convert tables to a readable text format (markdown tables or key-value pairs) rather than letting the extractor produce column-jumbled text
Remove boilerplate — table of contents, copyright notices, and blank pages don't need to be in your knowledge base

import re
 
def clean_pdf_text(text: str) -> str:
    """Clean common PDF extraction artifacts."""
    # Fix hyphenated line breaks
    text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)
    # Merge broken lines (not paragraph breaks)
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
    # Normalize whitespace
    text = re.sub(r' {2,}', ' ', text)
    # Remove page numbers on their own line
    text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
    return text.strip()

Putting It All Together

Here's the complete workflow for preparing a PDF for a custom GPT:

Extract text from the PDF using a tool that handles formatting (not just raw text extraction)
Clean the extracted text — fix line breaks, strip headers/footers, normalize whitespace
Chunk using section-based splitting with a max size of ~2,000 characters
Review a few chunks manually to verify quality
Save each chunk as a separate .txt file or combine into one file with clear separators
Upload to your custom GPT's knowledge base

For step 5, a common format is one .txt file per chunk, named descriptively:

refund-policy.txt
shipping-rates.txt
product-specifications.txt
account-management.txt

Or a single file with separators:

=== Refund Policy ===
[chunk content]

=== Shipping Rates ===
[chunk content]

Skip the Manual Work

If this sounds like a lot of steps for every document you want to add to a custom GPT, it is. That's exactly the problem Knowledge Builder Pro solves.

Upload your PDF (or DOCX, CSV, HTML, Markdown), and Knowledge Builder Pro extracts the text, cleans formatting artifacts, chunks it by topic with the right size for AI retrieval, and outputs clean .txt files ready for your custom GPT, Claude project, or RAG pipeline. The entire process takes seconds instead of the hour-plus you'd spend doing it by hand.

The chunking and cleaning logic is purpose-built for AI knowledge bases — not just generic text extraction. That difference shows up directly in how well your custom GPT answers questions.

How to Chunk a PDF for ChatGPT Custom GPTs

Why Raw PDFs Fail in Custom GPTs

Chunking Strategies That Actually Work

1. Section-Based Chunking (Best for Most Documents)

2. Fixed-Size Chunking with Overlap

3. Semantic Chunking (Advanced)

The Right Chunk Size for Custom GPTs

Cleaning Your PDF Before Chunking

Putting It All Together

Skip the Manual Work

Stop wrestling with messy documents

Related articles

Clean PDF Formatting for RAG and Custom GPTs: A Practical Guide

How to Chunk Documents for a RAG Pipeline

How to Prepare Documents for Claude Projects