Getting Started
Multi-modal agents process more than text. They see images, read documents, navigate websites, and interact with desktop applications. This capability transforms what agents can automate — instead of being limited to APIs and structured data, they can work with the same interfaces humans use.
The two key technologies are vision-language models (Claude, GPT-4V) that understand images alongside text, and browser/desktop automation tools (Playwright, Claude Computer Use) that give agents the ability to interact with visual interfaces.
Key Concepts
Vision models accept images as input alongside text prompts. You can send a screenshot, a photo, a chart, or a scanned document and ask questions about it. This is the foundation for document processing, visual QA, and UI automation:
import anthropic
import base64
client = anthropic.Anthropic()
def analyze_image(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
}},
{"type": "text", "text": question},
],
}],
)
return message.content[0].text
Document processing pipelines extract structured information from PDFs, invoices, receipts, and forms. Rather than writing brittle regex parsers for each format, send the document image to a vision model and ask it to extract the fields you need in a structured format.
Browser automation with Playwright gives agents the ability to navigate websites, click buttons, fill forms, and read content. Combined with vision models, this creates agents that can interact with any web application:
from playwright.async_api import async_playwright
async def screenshot_page(url: str) -> bytes:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
screenshot = await page.screenshot(full_page=True)
await browser.close()
return screenshot
Claude Computer Use takes this further by giving agents direct control over a desktop environment. The agent sees the screen, decides where to click or type, and executes actions through a controlled interface. This enables automation of desktop applications that have no API.
Hands-On Practice
Start with the simplest multi-modal task: screenshot analysis. Write a script that takes a URL, captures a screenshot with Playwright, and sends it to Claude's vision API with a question. Then extend it: navigate a multi-page site by having the agent decide which links to click based on what it sees, building toward a fully autonomous web research agent.