Skip to content
Phase 5: AdvancedStep 13 of 14AdvancedOngoingADVANCED

Multi-Modal & Computer Use

Build agents that see, read documents, and interact with browsers and desktops.

Vision modelsDocument understandingBrowser automationComputer useAudio/video processing

Getting Started

Multi-modal agents process more than text. They see images, read documents, navigate websites, and interact with desktop applications. This capability transforms what agents can automate — instead of being limited to APIs and structured data, they can work with the same interfaces humans use.

The two key technologies are vision-language models (Claude, GPT-4V) that understand images alongside text, and browser/desktop automation tools (Playwright, Claude Computer Use) that give agents the ability to interact with visual interfaces.

Key Concepts

Vision models accept images as input alongside text prompts. You can send a screenshot, a photo, a chart, or a scanned document and ask questions about it. This is the foundation for document processing, visual QA, and UI automation:

import anthropic
import base64

client = anthropic.Anthropic()

def analyze_image(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data,
                }},
                {"type": "text", "text": question},
            ],
        }],
    )
    return message.content[0].text

Document processing pipelines extract structured information from PDFs, invoices, receipts, and forms. Rather than writing brittle regex parsers for each format, send the document image to a vision model and ask it to extract the fields you need in a structured format.

Browser automation with Playwright gives agents the ability to navigate websites, click buttons, fill forms, and read content. Combined with vision models, this creates agents that can interact with any web application:

from playwright.async_api import async_playwright

async def screenshot_page(url: str) -> bytes:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        screenshot = await page.screenshot(full_page=True)
        await browser.close()
        return screenshot

Claude Computer Use takes this further by giving agents direct control over a desktop environment. The agent sees the screen, decides where to click or type, and executes actions through a controlled interface. This enables automation of desktop applications that have no API.

Hands-On Practice

Start with the simplest multi-modal task: screenshot analysis. Write a script that takes a URL, captures a screenshot with Playwright, and sends it to Claude's vision API with a question. Then extend it: navigate a multi-page site by having the agent decide which links to click based on what it sees, building toward a fully autonomous web research agent.

Exercises

Build a Web Scraping Agent with Playwright

Create an agent that takes a URL and a natural language question, navigates to the page using Playwright, takes a screenshot, sends the screenshot to a vision model for analysis, and returns a structured answer. Handle dynamic pages that require scrolling or clicking.

Knowledge Check

What is the key advantage of using vision models for web scraping compared to traditional HTML parsing?

Milestone Project

Agent that can navigate websites, fill forms, and extract structured data