The 60-Second Version

Large Language Models like GPT-4 and Claude are impressive, but they have a fundamental limitation: they only know what they were trained on. They can't access your company's policies, your product documentation, or last quarter's sales data. Unless you give them a way to look it up.

That's exactly what RAG does. Think of it as giving the LLM a research assistant. Before the model answers your question, RAG searches your documents, finds the most relevant passages, and hands them to the model as context. The model then generates a response grounded in your data, not just its training knowledge.

User Question "What's our refund policy?" Retriever Search & rank Your Documents Augmented Prompt Question + relevant document excerpts LLM Generate answer Grounded answer
Figure 1: The core RAG flow: Retrieve → Augment → Generate

The name itself tells you the recipe: Retrieval (find relevant info), Augmented (enrich the prompt with it), Generation (let the LLM produce the answer). It's elegantly simple in concept. The complexity lies in doing each step well.

When You Don't Need RAG

Before going deeper, let's get this out of the way. RAG isn't always the answer. Understanding when not to use it is just as important as understanding how it works.

Scenario RAG Needed? Better Approach
General knowledge questions No The LLM already knows this
Summarising a single short document No Pass the document directly in the prompt
Answering from company-specific docs Yes RAG: the LLM doesn't have this data
Querying structured data (SQL tables) Maybe Text-to-SQL may be more appropriate
Searching across thousands of documents Yes RAG excels here
Tasks needing real-time data (stock prices) Partially API integrations + RAG
Key Takeaway

If your document fits within the LLM's context window (most modern models handle 100K+ tokens), you may not need RAG at all. Just pass it directly. RAG shines when you're searching across large volumes of data where the LLM needs help finding the needle in the haystack.

How RAG Actually Works

Let's peel back the layers. The retrieval step (finding the right information) is where the real engineering happens. There are two fundamentally different approaches, and understanding both matters.

Level 1: Keyword Search

The simplest form of retrieval uses traditional keyword matching. Algorithms like TF-IDF and BM25 score documents based on how frequently your search terms appear, weighted by how rare those terms are across the full document set. If you search for "refund policy," it finds documents containing those exact words.

This works, but it's brittle. Search for "return process" and you might miss the document titled "Refund Policy," even though they mean the same thing.

Level 2: Semantic Search

This is where things get interesting. Instead of matching keywords, semantic search understands meaning. It converts your question and all your documents into numerical representations called embeddings, essentially translating language into coordinates in a high-dimensional space.

Semantic Similarity: concepts that mean similar things land near each other "refund policy" "return process" "money back" ← Query "shipping rates" "delivery times" "account setup" "password reset"
Figure 2: Embedding space. Semantically similar concepts cluster together, regardless of exact wording

The key insight is that embedding models capture meaning, not just words. "How do I get my money back?" and "What is the refund policy?" land in nearly the same spot in embedding space, so the retriever finds the right document even when the user's phrasing doesn't match the document's language.

Level 3: Vector Databases

Once you've converted your documents into embeddings, you need somewhere to store and search them efficiently. That's what vector databases like ChromaDB, Pinecone, Weaviate, and pgvector (for Postgres users) do. They're purpose-built for finding the nearest neighbours in high-dimensional space, and they do it fast.

Think of it as a specialised index. Instead of searching by keywords like a traditional database, you're searching by mathematical similarity. The query "explain our returns process" gets converted to a vector, and the database returns the documents whose vectors are closest to it.

Decision Point

Choosing the right vector database depends on scale and infrastructure. For prototypes and small datasets, ChromaDB (open-source, runs locally) is great. For production at scale, managed services like Pinecone or extending your existing Postgres with pgvector are more practical. If you're already on Databricks, Vector Search integrates natively with your lakehouse.

Level 4: Document Chunking

Here's something that trips up many teams on their first RAG implementation: you can't just throw entire documents into a vector database. A 50-page policy document needs to be broken into smaller pieces (chunks) so the retriever can return the specific relevant section, not the whole document.

Chunking strategy has an outsized impact on answer quality. Too large, and the retrieved context contains too much noise. Too small, and you lose critical surrounding context.

Original Document chunk Chunks (with overlap) Chunk 1 Chunk 2 ↑↓ overlap Chunk 3 Common strategies: • Fixed-size (e.g. 512 tokens) • Sentence-based • Semantic (by topic shift) • Recursive / hierarchical Always use overlap to preserve context at edges
Figure 3: Document chunking with overlap ensures no context is lost at boundaries

Building the Complete Pipeline

With the individual pieces understood, here's how they come together in a production RAG system. There are two distinct phases:

Phase 1: Indexing (done once, updated periodically)

1

Ingest

Load documents from your sources: SharePoint, S3, databases, PDFs, Confluence, and more.

2

Chunk

Split documents into appropriately-sized segments with overlap. Preserve metadata (source, date, section headers).

3

Embed

Convert each chunk into a vector using an embedding model (e.g. OpenAI's text-embedding-3-small or open-source alternatives like BGE or E5).

4

Store

Save the vectors and associated metadata in your vector database, indexed for fast similarity search.

Phase 2: Query (happens at runtime, per user question)

1

Embed the query

Convert the user's question into a vector using the same embedding model.

2

Retrieve

Search the vector database for the top-k most similar chunks. Optionally combine with keyword search (hybrid retrieval).

3

Augment

Construct a prompt that includes the user's question plus the retrieved chunks as context.

4

Generate

Send the augmented prompt to the LLM. The model answers using the provided context, reducing hallucination.

What Most Teams Get Wrong

After working across multiple RAG implementations in enterprise environments, these are the pitfalls we see most often:

Pitfall 1: Treating RAG as a One-Off Build

Your documents change. New policies get published, old ones get retired. Without an automated re-indexing pipeline, your RAG system silently goes stale. The answers look confident but cite outdated information, which is arguably worse than no answer at all.

Pitfall 2: Ignoring Chunk Quality

The default chunking strategy in most frameworks is "fixed-size, 512 tokens." This works for demos but falls apart with real documents. Tables get split in half. Context from a heading gets separated from its content. Invest time in chunking. It has the single biggest impact on retrieval quality.

Pitfall 3: No Evaluation Framework

How do you know your RAG system is actually returning the right answers? Without a systematic way to measure retrieval quality and answer accuracy (using metrics like recall@k, MRR, and faithfulness), you're flying blind. Build evaluation into your pipeline from day one.

Pitfall 4: Skipping Hybrid Search

Pure semantic search misses exact matches (like product codes or policy numbers), while pure keyword search misses meaning. The best production systems combine both: retrieving candidates through semantic similarity and keyword matching, then re-ranking the combined results.

From Prototype to Production

A working demo is maybe 20% of the effort. The remaining 80% is what separates a proof-of-concept from a system your organisation can actually rely on:

Caching. Identical or near-identical questions hit your embedding model and vector database repeatedly. A semantic cache (matching similar queries, not just exact duplicates) reduces latency and cost significantly.

Monitoring. Track retrieval quality over time. Are users getting the right chunks? Are certain question patterns consistently returning poor results? Log every query, the retrieved context, and the generated answer.

Guardrails. The LLM should only answer from the retrieved context. Instruct it to say "I don't have enough information" rather than hallucinate. This requires careful prompt engineering and, in some cases, a secondary verification step.

Access control. In enterprise settings, not every user should be able to retrieve every document. Your RAG system needs to respect existing document permissions, which means filtering at retrieval time based on the user's role.

Error handling and resilience. Embedding API timeouts, vector database connection failures, malformed documents. Production RAG systems need retry logic, circuit breakers, and graceful degradation strategies. These aren't glamorous, but they're what keep the system running at 3am.


Where to Start

If you're evaluating RAG for your organisation, here's a practical starting point: pick a single, well-scoped use case. Internal knowledge search, customer support documentation, or compliance Q&A are common first projects. Build a prototype with a small document set, measure the quality of answers honestly, and iterate on your chunking and retrieval strategy before scaling up.

The technology is mature enough for production today, but the gap between a demo and a reliable enterprise system is real. Getting the fundamentals right (chunking, retrieval strategy, evaluation, and monitoring) matters far more than chasing the latest framework.

OZ

OZ Data Solutions

Senior data and AI consulting, from strategy through to production. Melbourne, Australia.