Skip to content
Phase 4: HardeningStep 12 of 14Advanced1-2 weeksCHECKPOINT

Deployment & Scaling

Ship agents to production with containers, APIs, rate limiting, and cost controls.

DockerFastAPIKubernetes basicsRate limitingCachingCost optimization

Getting Started

An agent running on your laptop is a prototype. An agent behind an API with rate limiting, caching, and monitoring is a product. The gap between the two is deployment engineering: containerization, API design, resource management, and cost control.

The standard production stack for agent deployment is FastAPI for the HTTP layer, Docker for packaging, and a cloud platform (Railway, Render, or AWS) for hosting. This combination gives you a reproducible deployment that scales.

Key Concepts

Docker containerization ensures your agent runs the same way everywhere. A well-structured Dockerfile installs dependencies, copies your code, and defines the startup command:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

FastAPI provides the HTTP interface for your agent. Define typed request and response models so clients know exactly what to send and expect:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str
    session_id: str | None = None

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
    tokens_used: int

@app.post("/query", response_model=QueryResponse)
async def query_agent(request: QueryRequest):
    result = await agent.run(request.question)
    return QueryResponse(
        answer=result.answer,
        sources=result.sources,
        tokens_used=result.token_count,
    )

Rate limiting prevents abuse and controls costs. The slowapi library integrates with FastAPI to limit requests per user based on IP or API key:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/query")
@limiter.limit("10/minute")
async def query_agent(request: QueryRequest):
    ...

Caching avoids redundant LLM calls. For identical queries, return the cached response instead of making another API call. Use TTL-based expiry so the cache does not grow unbounded and responses stay fresh.

Cost optimization is critical because LLM calls are expensive. Route simple queries to smaller, cheaper models. Batch similar requests. Cache aggressively. Monitor token usage per endpoint and set budget alerts so you are never surprised by the bill.

Hands-On Practice

Take an agent you built in a previous step and wrap it in a FastAPI application. Add a /query endpoint, a /health endpoint, and rate limiting. Write a Dockerfile, build the image, and run it locally. Test it with curl or a simple Python client. Once it works locally, deploy it to Railway or Render and verify it responds to requests from the public internet.

Exercises

Deploy an Agent as a Production API

Package an existing agent into a Docker container with a FastAPI interface. Add rate limiting (10 requests per minute per user), response caching for identical queries, and a health check endpoint. Deploy it to a cloud provider of your choice.

Knowledge Check

Why is rate limiting important for a production agent API?

Milestone Project

Deploy agent API with monitoring, rate limiting, and autoscaling