Getting Started
Agents without memory are stateless — every conversation starts from zero. This limits their usefulness for any task that spans multiple interactions. Memory systems give agents the ability to recall past conversations, learn user preferences, and build knowledge over time.
The simplest form of memory is conversation history: appending every message to the messages array. This works for short conversations but hits context window limits quickly. A conversation with 50 exchanges can easily consume 20,000+ tokens, leaving little room for the actual task.
The solution is to layer multiple memory types, each serving a different purpose:
class AgentMemory:
def __init__(self):
self.short_term = [] # Current conversation messages
self.long_term = VectorStore() # Persistent fact storage
self.summary = "" # Running conversation summary
def add_message(self, role: str, content: str):
self.short_term.append({"role": role, "content": content})
# If conversation is getting long, summarize older messages
if len(self.short_term) > 20:
self._compress()
def _compress(self):
old_messages = self.short_term[:10]
self.summary = summarize(self.summary, old_messages)
self.short_term = self.short_term[10:]
Key Concepts
Vector embeddings are the foundation of semantic memory. An embedding model converts text into a numerical vector that captures its meaning. Similar texts produce similar vectors, enabling search by meaning rather than exact keyword match:
import chromadb
from chromadb.utils import embedding_functions
ef = embedding_functions.DefaultEmbeddingFunction()
client = chromadb.PersistentClient(path="./memory_db")
collection = client.get_or_create_collection("agent_memory", embedding_function=ef)
# Store a memory
collection.add(
documents=["User prefers concise answers under 200 words"],
metadatas=[{"type": "preference", "timestamp": "2026-01-15"}],
ids=["mem_001"]
)
# Retrieve relevant memories
results = collection.query(
query_texts=["How should I format my response?"],
n_results=3
)
Context window management is the practical challenge that drives memory system design. Current models have context windows ranging from 128K to 200K tokens, but performance often degrades with very long contexts. Effective strategies include:
- Sliding window: Keep only the most recent N messages in full.
- Summarization: Periodically compress older messages into a summary.
- Selective retrieval: Only inject memories that are relevant to the current query.
- Hierarchical memory: Store detailed memories locally, summaries in the main context.
Memory types map to different agent capabilities. Short-term memory (the current conversation) handles immediate context. Long-term memory (vector store) enables recall across sessions. Episodic memory (specific past interactions) helps the agent reference previous conversations. Semantic memory (distilled facts and preferences) captures what the agent has learned about the user and domain.
Hands-On Practice
Start with ChromaDB for local vector storage — it requires no infrastructure and persists to disk. Build a simple memory layer that extracts key facts from each conversation turn and stores them. Then test recall by starting a new conversation and asking about something mentioned in a previous session.
The goal is not perfect recall but relevant recall. An agent that surfaces the right context at the right time is far more useful than one that dumps its entire memory into every prompt.