RAG Systems

Getting Started

Retrieval-Augmented Generation is the pattern that lets agents answer questions about your specific data — documents, codebases, knowledge bases — rather than relying solely on what the LLM learned during training. RAG is one of the most practical and widely deployed patterns in production AI systems.

The core idea is straightforward: before generating a response, search a knowledge base for relevant information and include it in the prompt. This grounds the LLM's output in real data, reducing hallucination and enabling answers about information the model has never seen.

A RAG pipeline has three stages:

Documents -> Chunk -> Embed -> Store (indexing)
Query -> Embed -> Search -> Retrieve -> Generate (inference)

Key Concepts

Document chunking is the first critical decision. You need to split documents into pieces small enough to be relevant but large enough to contain meaningful context:

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[dict]:
    """Split text into overlapping chunks with metadata."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk_text = " ".join(words[i:i + chunk_size])
        chunks.append({
            "text": chunk_text,
            "start_index": i,
            "word_count": len(chunk_text.split())
        })
    return chunks

Overlapping chunks ensure that information near chunk boundaries is not lost. In practice, you will want smarter splitting that respects paragraph and section boundaries rather than splitting mid-sentence.

Embedding models convert text into dense vectors that capture semantic meaning. Popular choices include OpenAI's text-embedding-3-small, Cohere's embed-v3, and open-source models like all-MiniLM-L6-v2 from Sentence Transformers. The choice of embedding model affects both retrieval quality and cost:

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Hybrid search combines semantic search (vector similarity) with keyword search (BM25 or similar) for better results. Semantic search excels at finding conceptually related content, while keyword search catches exact matches that vector search might miss. Many production systems use both and merge the results:

def hybrid_search(query: str, collection, bm25_index, k: int = 5):
    # Semantic search
    semantic_results = collection.query(query_texts=[query], n_results=k)

    # Keyword search
    keyword_results = bm25_index.search(query, top_k=k)

    # Merge and deduplicate, weighting semantic higher
    return merge_results(semantic_results, keyword_results, semantic_weight=0.7)

Re-ranking is a second-pass scoring that improves precision. After retrieving candidate chunks, a cross-encoder model scores each chunk against the query more carefully than the initial embedding similarity. This significantly improves the quality of the top results.

Hands-On Practice

Start by building a RAG system over a manageable collection — a folder of markdown files or a set of PDFs from a single domain. Use ChromaDB for storage and OpenAI embeddings for simplicity.

Once the basic pipeline works, focus on evaluation. Create a set of 10-20 test questions with known answers and measure how often the correct chunk appears in the top results. This gives you a baseline to improve against as you experiment with chunk sizes, overlap ratios, embedding models, and re-ranking. Production RAG is an iterative process of measurement and refinement.

Getting Started

Key Concepts

Hands-On Practice

Exercises

Build RAG over 100+ Documents

Knowledge Check

Resources

From My Writing

Milestone Project