How to Prepare Documents for Claude Projects

Introduction

Claude Projects put your uploaded documents directly into the model's context window at every turn — no vector search, no chunking heuristics, no retrieval layer standing between Claude and your files. That sounds forgiving until you realize a single messy PDF with running headers, scanned pages, and twenty pages of boilerplate eats your entire token budget before Claude has room to think. Prepare documents for Claude Projects the way you'd prepare a brief for a senior analyst: strip the noise, structure the signal, and make every token pull its weight.

This guide covers exactly what to do before you click upload — the OCR step most people skip, the cleanup passes that reclaim thousands of tokens, and the structure tweaks that let Claude find the right section on the first try.

How Claude Projects Handle Your Files (and Why It Matters)

Claude Projects work differently from ChatGPT custom GPTs. Custom GPTs chunk your files, embed the chunks as vectors, and retrieve three to five at query time. Claude Projects do something closer to "attach every file to every message" — your docs land in the context window directly, up to the project's token limit.

That architectural difference changes what document prep means. For ChatGPT, the enemy is retrieval failure — the wrong chunks get pulled. For Claude, the enemy is token bloat — your useful content gets crowded out by PDF artifacts, duplicated headers, OCR garbage, and stale sections you forgot to strip. A 40-page PDF might actually be 12 pages of real content wrapped in 28 pages of repeating boilerplate, legal notices, and page furniture.

Claude Sonnet ships with a 200K-token context window, and Projects give you a generous slice of that for your files. It feels like a lot until you upload five PDFs from a messy shared drive and watch Claude start saying "the document does not contain that information" when it obviously does. The information is there — it's just buried under 80K tokens of repeating footer text.

Why You Need to Prepare Documents for Claude Projects at All

Claude is good at parsing messy input. It's not magic. Three failure modes repeat constantly in projects that skip document prep:

Scanned PDFs upload as images without text, and Claude "sees" a blank document
Repeating headers and footers pollute every page, tilting Claude's attention toward the wrong signal
Long exported wiki pages include nav menus and "last edited by" metadata on every entry, burning tokens on noise

Each of these is fixable in under five minutes per file. The cost of skipping prep is a project that answers confidently about the boilerplate and vaguely about your actual content.

Step-by-Step: Prepare Documents for Claude Projects

Step 1: Confirm the PDF is real text, not a scan

Open your PDF and try to select a single paragraph. If the selection highlights a rectangular image block instead of individual words, the PDF is a scan. Claude will see nothing useful. Run OCR before upload — Adobe Acrobat's Scan & OCR tool works, and on macOS ocrmypdf input.pdf output.pdf from Homebrew handles it in one command. Repeat this check for every file in the stack. Scanned contracts and old reports are the most common offenders.

Step 2: Strip repeating headers, footers, and page numbers

Every line that repeats on every page is a tax on your token budget. A 50-page PDF with "CONFIDENTIAL — INTERNAL ONLY" at the top of each page and "Page X of 50 | Q3 2025 Report" at the bottom is wasting roughly 15 tokens per page — about 750 tokens on that one file alone, and more importantly, it's nudging Claude's attention toward the boilerplate rather than the content.

Quickest cleanup path: convert to plain text with pdftotext -layout file.pdf file.txt, then drop repeating lines with a short Python pass:

from collections import Counter
 
lines = open("file.txt").read().splitlines()
freq = Counter(l.strip() for l in lines if l.strip())
clean = [l for l in lines if freq[l.strip()] < 3 or not l.strip()]
open("file.clean.txt", "w").write("\n".join(clean))

Anything that repeats three or more times across a document is almost always page furniture, not content.

Step 3: Convert to Markdown or clean text when possible

PDFs are a display format. They encode what a page looks like, not what it means. When Claude reads a PDF, it has to infer structure from visual layout cues that don't survive text extraction cleanly — especially for tables and multi-column pages. If you have the source document as a Word file, Google Doc, or Notion export, use that. Convert to Markdown so headings become ## Section lines, lists become - item entries, and tables keep their shape as pipe-delimited grids.

Clean Markdown files are roughly 30% smaller than the equivalent PDF in tokens and much easier for Claude to navigate by section name when you ask about a specific topic.

Step 4: Remove sections you don't actually need

Before upload, open each file and ask: does this section matter for the project's purpose? Legal disclaimers, change-log tables going back five years, appendices of raw SQL queries, org charts — all of it costs tokens and rarely gets queried. Delete ruthlessly. You can always add a file back if Claude can't answer something specific.

Step 5: Name files so Claude can reference them

Claude uses filenames as labels when it cites sources or decides which file to read first. final_v7_REAL_this_one.pdf tells Claude nothing. 2025-hiring-policy.md tells Claude exactly where to look when someone asks a policy question. Rename files descriptively before upload. This also makes your own future prompts easier — "check the hiring policy file" lands better than "check the big one about people."

Step 6: Check the total token footprint

Paste each cleaned file into Anthropic's token counter (or any OpenAI-compatible tokenizer for a close approximation — Claude's tokenizer runs slightly leaner than GPT's). Keep the total for the project well under your context limit, ideally with at least 30K tokens of headroom for the actual conversation. If you're over budget, split high-value content into its own project rather than trying to cram everything into one.

Common Mistakes to Avoid

Uploading the raw PDF and hoping Claude handles it. Claude can read most PDFs, but "can read" is not the same as "reads well." The five-minute cleanup from steps one through three consistently doubles answer quality on real projects. Skip it only for single-page PDFs with no scan, no headers, and no multi-column layout.

Uploading every file that might be relevant, just in case. Claude Projects reward editorial discipline. A project with 6 targeted files outperforms a project with 20 files where half are near-duplicates or tangential. If you're not sure whether a file earns its spot, leave it out — you can always add it later when you have a specific question it would answer.

Leaving exported wiki or Confluence pages as-is. Exports include navigation sidebars, breadcrumbs, "last edited" banners, and comment threads that Claude will happily quote back at you when you ask a question. Strip the wrapper before upload. Most wiki exporters have a "content only" option buried in their settings.

If you want to skip the manual work, Knowledge Builder Pro converts PDFs, Word docs, and messy exports into clean, AI-ready text in one click — OCR, header stripping, and Markdown conversion all handled automatically so your Claude Projects actually have room to think.

Wrapping Up

Prepare documents for Claude Projects the same way you'd prep a reference binder for a new hire: trim the fat, structure the signal, label everything clearly. The prep work you do before upload is the single biggest lever on answer quality — bigger than the prompt, bigger than the model choice, bigger than anything you can fix after the fact. Five minutes per file, measured against a project that will answer questions reliably for months, is one of the best returns on your time in any AI workflow.

Start your 7-day free trial at knowledgebuilderpro.com and stop hand-cleaning documents before every Claude Projects upload.