Skip to content
Phase 3: Production PatternsStep 7 of 14Intermediate2-3 weeks

RAG Systems

Ground agents in your data with retrieval-augmented generation

Document chunkingEmbedding modelsVector storesHybrid searchRe-rankingQuery refinement

Getting Started

Retrieval-Augmented Generation is the pattern that lets agents answer questions about your specific data — documents, codebases, knowledge bases — rather than relying solely on what the LLM learned during training. RAG is one of the most practical and widely deployed patterns in production AI systems.

The core idea is straightforward: before generating a response, search a knowledge base for relevant information and include it in the prompt. This grounds the LLM's output in real data, reducing hallucination and enabling answers about information the model has never seen.

A RAG pipeline has three stages:

Documents -> Chunk -> Embed -> Store (indexing)
Query -> Embed -> Search -> Retrieve -> Generate (inference)

Key Concepts

Document chunking is the first critical decision. You need to split documents into pieces small enough to be relevant but large enough to contain meaningful context:

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[dict]:
    """Split text into overlapping chunks with metadata."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk_text = " ".join(words[i:i + chunk_size])
        chunks.append({
            "text": chunk_text,
            "start_index": i,
            "word_count": len(chunk_text.split())
        })
    return chunks

Overlapping chunks ensure that information near chunk boundaries is not lost. In practice, you will want smarter splitting that respects paragraph and section boundaries rather than splitting mid-sentence.

Embedding models convert text into dense vectors that capture semantic meaning. Popular choices include OpenAI's text-embedding-3-small, Cohere's embed-v3, and open-source models like all-MiniLM-L6-v2 from Sentence Transformers. The choice of embedding model affects both retrieval quality and cost:

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Hybrid search combines semantic search (vector similarity) with keyword search (BM25 or similar) for better results. Semantic search excels at finding conceptually related content, while keyword search catches exact matches that vector search might miss. Many production systems use both and merge the results:

def hybrid_search(query: str, collection, bm25_index, k: int = 5):
    # Semantic search
    semantic_results = collection.query(query_texts=[query], n_results=k)

    # Keyword search
    keyword_results = bm25_index.search(query, top_k=k)

    # Merge and deduplicate, weighting semantic higher
    return merge_results(semantic_results, keyword_results, semantic_weight=0.7)

Re-ranking is a second-pass scoring that improves precision. After retrieving candidate chunks, a cross-encoder model scores each chunk against the query more carefully than the initial embedding similarity. This significantly improves the quality of the top results.

Hands-On Practice

Start by building a RAG system over a manageable collection — a folder of markdown files or a set of PDFs from a single domain. Use ChromaDB for storage and OpenAI embeddings for simplicity.

Once the basic pipeline works, focus on evaluation. Create a set of 10-20 test questions with known answers and measure how often the correct chunk appears in the top results. This gives you a baseline to improve against as you experiment with chunk sizes, overlap ratios, embedding models, and re-ranking. Production RAG is an iterative process of measurement and refinement.

Exercises

Build RAG over 100+ Documents

Create a complete RAG pipeline that ingests a collection of 100+ documents (PDFs, text files, or markdown), chunks them, generates embeddings, stores them in a vector database, and answers questions using retrieved context. Include evaluation metrics to measure retrieval quality.

Knowledge Check

Why is chunking strategy important in a RAG pipeline?

Milestone Project

RAG system over 100+ documents with evaluation metrics (precision, recall, answer quality)