How I Built a RAG System That Processes 10K+ Documents Daily
A deep dive into the architecture, challenges, and optimizations that went into building a production-ready RAG system for a financial services client.
When a financial services firm asked me to build a system that could answer complex regulatory questions by searching through their internal knowledge base of 10,000+ documents, I knew a standard keyword search wouldn't cut it. They needed semantic understanding — the ability to find answers even when the question and the source document used completely different words.
This is the story of how I designed, built, and deployed that system using Retrieval-Augmented Generation (RAG).
Why RAG Was the Only Real Option
The client initially asked for a fine-tuned model. After a discovery call, I identified three dealbreakers with fine-tuning:
1. Their documents update weekly. Fine-tuning requires retraining. That's expensive and slow.
2. They need source attribution. When the system says "the max exposure limit is $2M," the compliance team needs to see exactly which document that came from.
3. Budget constraints. Fine-tuning GPT-4 on 10K documents would cost thousands per training run.
RAG solved all three. The knowledge stays external, sources are always traceable, and the system only calls the LLM at query time — not during ingestion.
The Architecture
Here's the pipeline I settled on after two weeks of prototyping:
Data Ingestion Layer:
- A Python service watches designated folders and S3 buckets for new PDFs, DOCX files, and internal wiki exports.
- Each file is parsed using a combination of
pdfplumber(for structured PDFs with tables) andunstructured(for messy scanned documents with OCR fallback). - Raw text is cleaned: headers/footers stripped, page numbers removed, tables converted to markdown format.
Chunking Strategy:
- I tested three approaches: fixed-size (500 tokens), sentence-based, and semantic chunking.
- Winner: Semantic chunking using paragraph boundaries with a 400-token target and 80-token overlap. This preserved context far better than arbitrary splits.
- Each chunk carries metadata: source document name, page number, section heading, and ingestion timestamp.
Embedding Pipeline:
- Model:
text-embedding-3-largefrom OpenAI (3072 dimensions). I testedall-MiniLM-L6-v2first but the accuracy drop on financial jargon was measurable — 8% worse on our evaluation set. - Embeddings are generated in batches of 100 chunks to stay within rate limits.
- Total embedding cost: roughly $12/month for 10K documents with weekly re-ingestion.
Vector Storage:
- Pinecone on the Standard plan. I chose Pinecone over pgvector here because the client needed sub-200ms query latency at scale, and Pinecone's managed infrastructure handles that without any tuning.
- Each vector is stored with its metadata for filtered retrieval (e.g., "only search documents from the compliance department").
Retrieval + Generation:
- At query time, the user's question is embedded and the top 5 chunks are retrieved via cosine similarity.
- A reranker (
Cohere Rerank) rescores the chunks to push the most relevant ones to the top. This step alone improved answer accuracy by 12%. - The final prompt is assembled: system instructions + retrieved chunks + user question. Sent to
gpt-4-turbowith temperature 0.1 for factual grounding.
The Challenges I Didn't Expect
1. Table extraction was brutal. Financial documents are full of tables. When I chunked them as plain text, the LLM couldn't correlate column headers with cell values. Fix: I converted tables to markdown format during preprocessing, which preserved structure beautifully.
2. Duplicate content across documents. Many documents contained identical boilerplate sections. This polluted retrieval results. Fix: I added a deduplication step using MinHash to detect near-duplicate chunks and keep only the most recent version.
3. Long documents with shifting topics. A single 80-page PDF might cover ten different topics. If a chunk from page 3 (about risk management) was adjacent to a chunk from page 4 (about HR policies), the overlap window would blend them. Fix: I added section-aware splitting that respects heading boundaries.
The Results
After 6 weeks of development and 2 weeks of user testing:
- 98% factual accuracy on a 200-question evaluation set (manually verified by the compliance team)
- Query latency under 400ms end-to-end (embedding → retrieval → generation)
- 70% reduction in analyst research time — what used to take 30 minutes of manual document searching now takes one question
- Source attribution on every answer — the system returns the exact document name, page number, and relevant passage
Key Takeaways
1. Chunking strategy matters more than model choice. Switching from fixed-size to semantic chunking improved accuracy by 15% — more than any model upgrade.
2. Always add a reranker. The initial retrieval gets you in the right neighborhood. The reranker gets you to the right house.
3. Metadata is your superpower. Storing section headings, document types, and dates with each chunk enables powerful filtered retrieval.
4. Test with real users early. My evaluation set missed several question patterns that actual analysts asked. User testing caught issues I never would have found in isolation.
Want a System Like This?
If you're sitting on a mountain of documents and your team is still Ctrl+F-ing through PDFs, let's talk. I build production RAG systems that turn your existing knowledge into an intelligent, searchable assistant.
[Try my live RAG demo →](/demo) or [get in touch →](/contact) to discuss your project.
Related Articles
RAG vs Fine-Tuning: A Practical Guide for Business Owners
When should you use RAG? When is fine-tuning better? This guide breaks down the trade-offs with real-world examples and cost analysis.
Vector Databases Compared: Pinecone vs Weaviate vs pgvector
A technical comparison of the leading vector databases for RAG applications, with benchmarks and use case recommendations.
Why Most Chatbots Fail (And How to Build Ones That Don't)
The common pitfalls in chatbot development and the architectural decisions that separate successful implementations from failed ones.
Ready to Build Your AI System?
I build production RAG systems, intelligent chatbots, and AI automation pipelines. Let's turn your data into decisions.