The 60-Second Version
Large Language Models like GPT-4 and Claude are impressive, but they have a fundamental limitation: they only know what they were trained on. They can't access your company's policies, your product documentation, or last quarter's sales data. Unless you give them a way to look it up.
That's exactly what RAG does. Think of it as giving the LLM a research assistant. Before the model answers your question, RAG searches your documents, finds the most relevant passages, and hands them to the model as context. The model then generates a response grounded in your data, not just its training knowledge.
The name itself tells you the recipe: Retrieval (find relevant info), Augmented (enrich the prompt with it), Generation (let the LLM produce the answer). It's elegantly simple in concept. The complexity lies in doing each step well.
When You Don't Need RAG
Before going deeper, let's get this out of the way. RAG isn't always the answer. Understanding when not to use it is just as important as understanding how it works.
| Scenario | RAG Needed? | Better Approach |
|---|---|---|
| General knowledge questions | No | The LLM already knows this |
| Summarising a single short document | No | Pass the document directly in the prompt |
| Answering from company-specific docs | Yes | RAG: the LLM doesn't have this data |
| Querying structured data (SQL tables) | Maybe | Text-to-SQL may be more appropriate |
| Searching across thousands of documents | Yes | RAG excels here |
| Tasks needing real-time data (stock prices) | Partially | API integrations + RAG |
If your document fits within the LLM's context window (most modern models handle 100K+ tokens), you may not need RAG at all. Just pass it directly. RAG shines when you're searching across large volumes of data where the LLM needs help finding the needle in the haystack.
How RAG Actually Works
Let's peel back the layers. The retrieval step (finding the right information) is where the real engineering happens. There are two fundamentally different approaches, and understanding both matters.
Level 1: Keyword Search
The simplest form of retrieval uses traditional keyword matching. Algorithms like TF-IDF and BM25 score documents based on how frequently your search terms appear, weighted by how rare those terms are across the full document set. If you search for "refund policy," it finds documents containing those exact words.
This works, but it's brittle. Search for "return process" and you might miss the document titled "Refund Policy," even though they mean the same thing.
Level 2: Semantic Search
This is where things get interesting. Instead of matching keywords, semantic search understands meaning. It converts your question and all your documents into numerical representations called embeddings, essentially translating language into coordinates in a high-dimensional space.
The key insight is that embedding models capture meaning, not just words. "How do I get my money back?" and "What is the refund policy?" land in nearly the same spot in embedding space, so the retriever finds the right document even when the user's phrasing doesn't match the document's language.
Level 3: Vector Databases
Once you've converted your documents into embeddings, you need somewhere to store and search them efficiently. That's what vector databases like ChromaDB, Pinecone, Weaviate, and pgvector (for Postgres users) do. They're purpose-built for finding the nearest neighbours in high-dimensional space, and they do it fast.
Think of it as a specialised index. Instead of searching by keywords like a traditional database, you're searching by mathematical similarity. The query "explain our returns process" gets converted to a vector, and the database returns the documents whose vectors are closest to it.
Choosing the right vector database depends on scale and infrastructure. For prototypes and small datasets, ChromaDB (open-source, runs locally) is great. For production at scale, managed services like Pinecone or extending your existing Postgres with pgvector are more practical. If you're already on Databricks, Vector Search integrates natively with your lakehouse.
Level 4: Document Chunking
Here's something that trips up many teams on their first RAG implementation: you can't just throw entire documents into a vector database. A 50-page policy document needs to be broken into smaller pieces (chunks) so the retriever can return the specific relevant section, not the whole document.
Chunking strategy has an outsized impact on answer quality. Too large, and the retrieved context contains too much noise. Too small, and you lose critical surrounding context.
Building the Complete Pipeline
With the individual pieces understood, here's how they come together in a production RAG system. There are two distinct phases:
Phase 1: Indexing (done once, updated periodically)
Ingest
Load documents from your sources: SharePoint, S3, databases, PDFs, Confluence, and more.
Chunk
Split documents into appropriately-sized segments with overlap. Preserve metadata (source, date, section headers).
Embed
Convert each chunk into a vector using an embedding model (e.g. OpenAI's text-embedding-3-small or open-source alternatives like BGE or E5).
Store
Save the vectors and associated metadata in your vector database, indexed for fast similarity search.
Phase 2: Query (happens at runtime, per user question)
Embed the query
Convert the user's question into a vector using the same embedding model.
Retrieve
Search the vector database for the top-k most similar chunks. Optionally combine with keyword search (hybrid retrieval).
Augment
Construct a prompt that includes the user's question plus the retrieved chunks as context.
Generate
Send the augmented prompt to the LLM. The model answers using the provided context, reducing hallucination.
What Most Teams Get Wrong
After working across multiple RAG implementations in enterprise environments, these are the pitfalls we see most often:
Your documents change. New policies get published, old ones get retired. Without an automated re-indexing pipeline, your RAG system silently goes stale. The answers look confident but cite outdated information, which is arguably worse than no answer at all.
The default chunking strategy in most frameworks is "fixed-size, 512 tokens." This works for demos but falls apart with real documents. Tables get split in half. Context from a heading gets separated from its content. Invest time in chunking. It has the single biggest impact on retrieval quality.
How do you know your RAG system is actually returning the right answers? Without a systematic way to measure retrieval quality and answer accuracy (using metrics like recall@k, MRR, and faithfulness), you're flying blind. Build evaluation into your pipeline from day one.
Pure semantic search misses exact matches (like product codes or policy numbers), while pure keyword search misses meaning. The best production systems combine both: retrieving candidates through semantic similarity and keyword matching, then re-ranking the combined results.
From Prototype to Production
A working demo is maybe 20% of the effort. The remaining 80% is what separates a proof-of-concept from a system your organisation can actually rely on:
Caching. Identical or near-identical questions hit your embedding model and vector database repeatedly. A semantic cache (matching similar queries, not just exact duplicates) reduces latency and cost significantly.
Monitoring. Track retrieval quality over time. Are users getting the right chunks? Are certain question patterns consistently returning poor results? Log every query, the retrieved context, and the generated answer.
Guardrails. The LLM should only answer from the retrieved context. Instruct it to say "I don't have enough information" rather than hallucinate. This requires careful prompt engineering and, in some cases, a secondary verification step.
Access control. In enterprise settings, not every user should be able to retrieve every document. Your RAG system needs to respect existing document permissions, which means filtering at retrieval time based on the user's role.
Error handling and resilience. Embedding API timeouts, vector database connection failures, malformed documents. Production RAG systems need retry logic, circuit breakers, and graceful degradation strategies. These aren't glamorous, but they're what keep the system running at 3am.
Where to Start
If you're evaluating RAG for your organisation, here's a practical starting point: pick a single, well-scoped use case. Internal knowledge search, customer support documentation, or compliance Q&A are common first projects. Build a prototype with a small document set, measure the quality of answers honestly, and iterate on your chunking and retrieval strategy before scaling up.
The technology is mature enough for production today, but the gap between a demo and a reliable enterprise system is real. Getting the fundamentals right (chunking, retrieval strategy, evaluation, and monitoring) matters far more than chasing the latest framework.