The file format you upload to your custom GPT knowledge base matters more than most builders realize. Two identical documents — same content, same length — can produce wildly different retrieval accuracy depending on whether you upload a .pdf, .txt, or .docx. The difference comes down to how OpenAI's retrieval system parses and indexes each format, and some formats make that job significantly easier than others.
If you're getting vague or incorrect answers from your custom GPT and the content is definitely in your files, the format is the first thing to check.
How Custom GPT Retrieval Actually Handles Each Format
When you upload a file to a custom GPT, OpenAI's system converts it into text, splits that text into chunks, embeds those chunks as vectors, and stores them in a vector store. When a user asks a question, the system finds the most relevant chunks by comparing the question's embedding against the stored vectors.
The critical step is the first one: text conversion. If the parser produces clean, well-structured text, the chunks make sense and the embeddings are accurate. If the parser produces garbage — broken tables, interleaved columns, inline headers — the chunks are noisy and retrieval degrades.
Here is how each supported format performs at that conversion step:
| Format | Parse Quality | Structure Preserved | Best For |
|--------|--------------|-------------------|----------|
| .txt | Near-perfect | None (plain text) | Pre-cleaned content |
| .md | Excellent | Headings, lists, code blocks | Technical docs |
| .csv | Good | Tabular data intact | Structured datasets |
| .json | Good | Key-value pairs intact | API responses, configs |
| .docx | Good | Headings, some formatting | Word documents |
| .pdf | Variable | Often broken | Legacy docs (use with caution) |
The ranking is clear: formats that are already close to plain text parse the most reliably. The further a format is from raw text, the more opportunities for the parser to introduce errors.
Why TXT and Markdown Win for Most Use Cases
Plain .txt files give the retrieval system exactly what it needs: clean text with no ambiguity about reading order, structure, or character encoding. There is nothing to misparse. Every character in the file is a character in the index.
Markdown (.md) adds just enough structure — headings with #, lists with -, code blocks with triple backticks — without introducing any of the visual-rendering complexity that trips up PDF extractors. If your knowledge base has natural sections or hierarchies, markdown preserves those boundaries in a way the chunking system can use.
A practical example: say you are building a custom GPT for internal company policies. Your HR team hands you a 40-page Word document. You could upload the .docx directly, but a better approach is:
- Convert the
.docxto markdown (tools like Pandoc handle this in one command) - Review the output for formatting artifacts
- Upload the clean
.mdfile
# Convert DOCX to clean markdown
pandoc company-policies.docx -t markdown -o company-policies.md
# Quick check for formatting issues
head -100 company-policies.mdThat 10 seconds of conversion can measurably improve the quality of answers your custom GPT returns, because every chunk the retrieval system pulls will be clean and contextually coherent.
When PDF Is Your Only Option (and How to Minimize Damage)
Sometimes you are stuck with PDFs. Regulatory documents, academic papers, scanned forms — the original source only exists as a PDF, and recreating it in another format is not practical.
The issue with PDFs is not that they cannot work in a custom GPT. They can. The issue is that PDF parsing is inherently lossy. A PDF is a set of drawing instructions, not a text document. The parser has to reconstruct reading order, paragraph boundaries, and table structures from positioned characters on a canvas. It gets this wrong often enough that you should never assume a PDF uploaded cleanly.
Three specific things break most frequently with PDFs in custom GPTs:
-
Tables — Column data gets linearized into a single line. A table with "Product | Price | SKU" becomes "Product Price SKU Widget 9.99 A100" with no separators. The retrieval system cannot distinguish column values from row values.
-
Multi-column layouts — Text from the left column and right column gets interleaved. A paragraph from column A might have sentences from column B injected into the middle.
-
Headers and footers — Page numbers, document titles, and copyright notices appear inline with body text on every page. If your document is 50 pages, you have 50 instances of "Company Name | Confidential | Page X" scattered through the indexed content.
If you must use PDFs, the best approach is to pre-process them into clean text before uploading. Extract the text, fix any parsing artifacts, and upload the cleaned version as .txt or .md. Knowledge Builder Pro automates this entire pipeline — upload your messy PDF, get a clean, chunked file optimized for custom GPTs and AI agents in seconds.
CSV and JSON: The Right Choice for Structured Data
If your knowledge base includes structured data — product catalogs, FAQs, pricing tables, API references — .csv and .json formats outperform text-based formats because they preserve the relationship between fields.
Consider a product catalog. In a .txt file, a product entry might look like:
Widget Pro - $29.99 - Available in Red, Blue, Green - SKU: WP-100
The retrieval system has to infer where the product name ends and the price begins. In a .csv, the structure is explicit:
name,price,colors,sku
Widget Pro,29.99,"Red, Blue, Green",WP-100For FAQ-style knowledge bases, .json works well because each question-answer pair is a self-contained object that maps naturally to a single chunk:
[
{
"question": "What is the return policy?",
"answer": "Returns accepted within 30 days of purchase with original receipt. Opened software is non-refundable."
},
{
"question": "Do you offer international shipping?",
"answer": "Yes, to 45 countries. Shipping rates calculated at checkout based on weight and destination."
}
]Each JSON object becomes a clean, self-contained chunk. The retrieval system can match a user question directly to the right answer without pulling in unrelated content from adjacent entries.
A Decision Framework for Choosing Your Format
Rather than guessing, use this checklist to pick the best file format for your specific custom GPT knowledge base:
-
Is the content already in plain text or markdown? Upload it as-is. Do not convert to PDF or DOCX — you are only adding parsing risk.
-
Is it a Word document? Convert to markdown with Pandoc. Review the output for 2 minutes. Upload the
.mdfile. -
Is it structured data (products, FAQs, tables)? Use
.csvfor tabular data,.jsonfor nested or key-value data. -
Is it a PDF with simple text layout (single column, no tables)? Try uploading directly. Test with 5-10 specific questions. If answers are accurate, the parse was clean enough.
-
Is it a PDF with complex layout (tables, columns, scanned pages)? Do not upload directly. Extract and clean the text first, either manually or with a preprocessing tool. Upload the cleaned version as
.txtor.md. -
Do you have multiple small files on the same topic? Combine them into a single file per topic area. Fewer, larger files with clear internal structure outperform many tiny files, because each file carries metadata overhead and the 20-file limit is a hard cap.
The file format decision takes 2 minutes and can be the difference between a custom GPT that feels like it actually read your documents and one that hallucinates because it pulled a garbled table as its source of truth.
Wrapping Up
The best file format for a custom GPT knowledge base is whichever format produces the cleanest text after parsing. For most builders, that means .txt or .md for prose content, .csv for tabular data, and .json for structured Q&A or reference data. PDFs work when the layout is simple, but anything with tables, columns, or scanned pages needs preprocessing before upload.
If you want to skip the manual conversion and cleanup, Knowledge Builder Pro handles the entire pipeline automatically — upload your files in any format and get optimized, chunked output ready for your custom GPT or Claude Project. Start your 7-day free trial and test it with your own documents.