Build LogDec 15, 202412 min read

How I Built a RAG System That Processes 10K+ Documents Daily

A deep dive into the architecture, challenges, and optimizations that went into building a production-ready RAG system for a financial services client.

RAGVector DatabaseArchitecturePython

When a financial services firm asked me to build a system that could answer complex regulatory questions by searching through their internal knowledge base of 10,000+ documents, I knew a standard keyword search wouldn't cut it. They needed semantic understanding — the ability to find answers even when the question and the source document used completely different words.

This is the story of how I designed, built, and deployed that system using Retrieval-Augmented Generation (RAG).

Why RAG Was the Only Real Option

The client initially asked for a fine-tuned model. After a discovery call, I identified three dealbreakers with fine-tuning:

1. Their documents update weekly. Fine-tuning requires retraining. That's expensive and slow.

2. They need source attribution. When the system says "the max exposure limit is $2M," the compliance team needs to see exactly which document that came from.

3. Budget constraints. Fine-tuning GPT-4 on 10K documents would cost thousands per training run.

RAG solved all three. The knowledge stays external, sources are always traceable, and the system only calls the LLM at query time — not during ingestion.

The Architecture

Here's the pipeline I settled on after two weeks of prototyping:

Data Ingestion Layer:

A Python service watches designated folders and S3 buckets for new PDFs, DOCX files, and internal wiki exports.
Each file is parsed using a combination of pdfplumber (for structured PDFs with tables) and unstructured (for messy scanned documents with OCR fallback).
Raw text is cleaned: headers/footers stripped, page numbers removed, tables converted to markdown format.

Chunking Strategy:

I tested three approaches: fixed-size (500 tokens), sentence-based, and semantic chunking.
Winner: Semantic chunking using paragraph boundaries with a 400-token target and 80-token overlap. This preserved context far better than arbitrary splits.
Each chunk carries metadata: source document name, page number, section heading, and ingestion timestamp.

Embedding Pipeline:

Model: text-embedding-3-large from OpenAI (3072 dimensions). I tested all-MiniLM-L6-v2 first but the accuracy drop on financial jargon was measurable — 8% worse on our evaluation set.
Embeddings are generated in batches of 100 chunks to stay within rate limits.
Total embedding cost: roughly $12/month for 10K documents with weekly re-ingestion.

Vector Storage:

Pinecone on the Standard plan. I chose Pinecone over pgvector here because the client needed sub-200ms query latency at scale, and Pinecone's managed infrastructure handles that without any tuning.
Each vector is stored with its metadata for filtered retrieval (e.g., "only search documents from the compliance department").

Retrieval + Generation:

At query time, the user's question is embedded and the top 5 chunks are retrieved via cosine similarity.
A reranker (Cohere Rerank) rescores the chunks to push the most relevant ones to the top. This step alone improved answer accuracy by 12%.
The final prompt is assembled: system instructions + retrieved chunks + user question. Sent to gpt-4-turbo with temperature 0.1 for factual grounding.

The Challenges I Didn't Expect

1. Table extraction was brutal. Financial documents are full of tables. When I chunked them as plain text, the LLM couldn't correlate column headers with cell values. Fix: I converted tables to markdown format during preprocessing, which preserved structure beautifully.

2. Duplicate content across documents. Many documents contained identical boilerplate sections. This polluted retrieval results. Fix: I added a deduplication step using MinHash to detect near-duplicate chunks and keep only the most recent version.

3. Long documents with shifting topics. A single 80-page PDF might cover ten different topics. If a chunk from page 3 (about risk management) was adjacent to a chunk from page 4 (about HR policies), the overlap window would blend them. Fix: I added section-aware splitting that respects heading boundaries.

The Results

After 6 weeks of development and 2 weeks of user testing:

98% factual accuracy on a 200-question evaluation set (manually verified by the compliance team)
Query latency under 400ms end-to-end (embedding → retrieval → generation)
70% reduction in analyst research time — what used to take 30 minutes of manual document searching now takes one question
Source attribution on every answer — the system returns the exact document name, page number, and relevant passage

Key Takeaways

1. Chunking strategy matters more than model choice. Switching from fixed-size to semantic chunking improved accuracy by 15% — more than any model upgrade.

2. Always add a reranker. The initial retrieval gets you in the right neighborhood. The reranker gets you to the right house.

3. Metadata is your superpower. Storing section headings, document types, and dates with each chunk enables powerful filtered retrieval.

4. Test with real users early. My evaluation set missed several question patterns that actual analysts asked. User testing caught issues I never would have found in isolation.

Want a System Like This?

If you're sitting on a mountain of documents and your team is still Ctrl+F-ing through PDFs, let's talk. I build production RAG systems that turn your existing knowledge into an intelligent, searchable assistant.

[Try my live RAG demo →](/demo) or [get in touch →](/contact) to discuss your project.

Guide