Skip to content
← All posts

Opus 4.7 Leads Every Benchmark Today. Here Are 5 Problems Where That Actually Matters.

10 min read
ShareXLinkedIn

Anthropic shipped Claude Opus 4.7 today, and the benchmark sheet is unusually complete. SWE-Bench Pro jumped from 53.4% to 64.3%, an 11-point swing. SWE-Bench Verified crossed 87.6%. OSWorld hit 78%. CharXiv visual reasoning leapt from 69.1% to 82.1%. GPQA Diamond sits at 94.2%. It leads on agentic reasoning over GPT-5.4 and Gemini 3.1 Pro, though Anthropic notes it trails its own unreleased Mythos Preview model — the most powerful generally available LLM, in VentureBeat's framing.

Pricing is unchanged from Opus 4.6: $5 per million input tokens, $25 per million output tokens.

Two months ago I wrote about Sonnet 4.6 and made the case that for the bulk of daily work — single-file edits, document Q&A, quick drafts — Sonnet at $3/$15 is the default. That hasn't changed. Sonnet is still the workhorse.

But there is a band of problems where Opus 4.7 is not a luxury — it's the only model I trust to finish the job. Long-horizon agent work without drift, security analysis where a subtle miss is expensive, multi-file refactors where discipline matters more than speed, and visual reasoning over real documents. Benchmarks are the scaffolding. These are the five places the capability actually shows up.

What's Actually New in 4.7

Before the use cases, the shortlist of changes worth knowing:

  • xhigh effort level. New reasoning intensity slot between "high" and "max." Gives finer control over the reasoning-latency tradeoff — you're no longer choosing between "thinks hard" and "thinks forever."
  • Vision, 3x resolution. Accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1.15 megapixels in 4.6. CharXiv jumped 13 points and visual navigation without tools went from 57.7% to 79.5%.
  • /ultrareview in Claude Code. Dedicated review sessions tuned for bugs and design issues, not drive-by feedback.
  • Task budgets (API beta). You can cap token spend on long agent runs — meaningful if you've ever watched an agent burn $40 spiraling on one bug.
  • Better file-based memory. More reliable multi-session scratchpads and notes.
  • Instruction literality. Anthropic flags this explicitly: 4.7 takes instructions precisely rather than loosely. If your existing prompts assumed the model would fill in gaps, expect to retune.
  • Automatic cybersecurity safeguards. First Claude with automated detection and blocking of prohibited cybersecurity uses, paired with a new Cyber Verification Program for legitimate security professionals. More on this in §2.

If you're already on 4.6, this is an upgrade, not a migration. If you skipped 4.6 and are on 4.5, the cumulative gap is large enough to justify the swap on day one.

1. Agentic Coding in Claude Code

This is where Opus 4.7 most obviously earns its keep. The 11-point SWE-Bench Pro jump is not a rounding-error improvement — it's the difference between "finishes the easy cases" and "finishes the cases that used to require a human rescue." Rakuten reported 3x more production tasks resolved on Rakuten-SWE-Bench. Anthropic quotes a fintech VP of Technology: "Opus 4.7 catches logical faults during planning and accelerates execution far beyond previous models."

The specific behavior I care about: Opus does not drift. Sonnet, on a long-running task, occasionally rewrites a function it should not have touched or introduces a helper no one asked for. Opus 4.7 holds the constraints — "do not touch the auth layer," "keep the existing error contract," "do not add dependencies" — across dozens of turns. With the new instruction literality, that discipline tightens further. If you say "only modify files matching this glob," it actually only modifies those files.

Two features to try together: the /ultrareview command for focused bug and design review sessions, and the xhigh effort level when you want the model to think harder on a plan than "high" allows but don't want the latency of max.

The pattern that makes Opus worth paying for here: anything where the reasoning itself is the product, not just the means to an end. Code generation can use Sonnet. Architectural planning across a repo should use Opus.

2. Security Work: Threat Intel, Vuln Triage, Red Team

Important change first: Opus 4.7 is the first Claude with automated detection and blocking of prohibited cybersecurity requests. Anthropic also launched a Cyber Verification Program for legitimate security professionals. If you do authorized red-team work, pen testing, or vulnerability research, pay attention to whether your prompts are tripping the new safeguards, and get verified if your workflow depends on the edge cases.

That caveat aside: Anthropic explicitly lists autonomous penetration testing as a highlighted use case in the 4.7 announcement, and the model card reflects it. The cyber capabilities were intentionally dialed back from Mythos Preview, but they are still improvements over 4.6 — which itself found 500+ zero-days in fuzzed open-source codebases under test conditions.

Three security tasks where I reach for Opus 4.7 rather than Sonnet:

  • Reading and summarizing threat actor reports. The model needs to separate the attack chain from marketing fluff and speculation. Opus 4.7 does this cleanly and flags what's unverified rather than restating claims as fact.
  • Reviewing infrastructure-as-code for misconfigurations. The 1M context means you can load entire Terraform or Kubernetes manifests and ask "what's exposed to the internet that shouldn't be, and what's the blast radius if it's hit?" The vision improvement helps when your IaC lives alongside architecture diagrams.
  • Reasoning about prompt-injection and agent-misuse scenarios. Meta but real: using Opus to think through adversarial inputs to other Claude-powered systems you're building. The instruction literality helps here — a model that follows "do not take actions the user did not request" precisely is easier to reason about than one that interprets it loosely.

For authorized security work, the automatic blocking is the biggest UX change in the release. Plan for it.

3. Writing That Has to Hold a Voice

Fair warning: this is the use case most writers will push back on. Sonnet 4.6 drafts competent prose. Opus 4.7 is more expensive. The math looks bad.

Here's why I still use it for drafts that matter. Opus 4.7's instruction literality changes what's possible in prose review. "Read this draft, tell me where the argument is weak, do not suggest rewrites, just find the holes" actually produces criticism without patching — Sonnet tends to offer fixes even when told not to. That constraint-following matters when you want a harsh reader, not a co-author.

The other thing: Opus is better at holding a voice across a long piece. When you're writing in your own voice — specific opinions, specific phrasings, a rhythm you care about — Sonnet occasionally smooths edges into AI-adjacent prose. Opus keeps the edges, especially at xhigh effort.

I'm not using Opus to write first drafts. I'm using it as a harsher editor than I can be on my own work. That's a narrow, defensible reason to pay the premium for a use case most people will say isn't worth it.

4. Long-Document Analysis (Including the Pictures)

Opus 4.6 set the bar on long-context retrieval with MRCR v2 at 76% versus Sonnet 4.5's 18.5%. Opus 4.7 keeps the 1M context window and adds the vision upgrade — which matters more than it sounds, because real-world long documents aren't pure text. Contracts have signature pages scanned in. Compliance frameworks have diagrams. Financial reports are 30% charts. The 3x image resolution means you can actually parse them now.

The pattern I use to decide between Sonnet and Opus: if the task requires synthesizing across the document, pay for Opus. If it requires finding a fact within the document, Sonnet is fine.

Three workflows where Opus 4.7 specifically:

  • Loading an entire codebase into context and asking architectural questions. Not "what does this file do" — that's Sonnet-tier. But "trace this data flow end-to-end across 40 files and tell me where the implicit assumptions break down" is Opus work.
  • Compliance document review. Load a regulatory framework (SOC 2, HIPAA, GDPR) alongside an internal policy set and ask it to find gaps. 1M context plus higher-fidelity retrieval means you don't have to pre-summarize — which is where nuance usually gets lost.
  • Financial analysis with real documents. Opus 4.7 hit 64.4% on the Finance Agent benchmark (up from 60.1%), and the vision improvements mean it can actually read the embedded charts in a 10-K instead of working off just the text. Anthropic calls out legal document analysis and patent workflows in the same breath.

5. Agent Teams and Parallel Workflows

Anthropic's C compiler project showed what 16 parallel Claudes could do with Opus 4.6: a 100,000-line compiler that boots Linux, built in two weeks for about $20K. Opus 4.7 adds two things that matter for serious agent orchestration: 14% better multi-step agentic reasoning with roughly a third the tool errors, and API task budgets you can use to cap runaway spend.

Agent teams is the feature most people don't use, because for day-to-day work it's overkill. And for day-to-day work, it is overkill. One Opus or Sonnet session is almost always enough.

Two workflows where it's a real unlock:

  1. Parallelizable security audits. One agent per component — auth, database, API surface, frontend, deployment. Each reports findings, a coordinator synthesizes. The parallelism is real because the components are genuinely independent.
  2. Large refactors across independent modules. Same pattern — parcel the work, let each agent operate on its slice, reconcile at the end. Use task budgets to cap per-agent spend so one stuck subagent doesn't blow the budget for the whole run.

The main failure mode, which Anthropic flagged in the C compiler write-up, is "parallelism collapse" — agents all hit the same bug at once, all fix it independently, overwrite each other's work. Stratify the tasks so agents can't collide on the same hot path. The fewer tool errors in 4.7 help, but they don't make the orchestration problem go away.

The Heuristic: When to Use Which

Opus 4.7 costs about 1.7x Sonnet 4.6 per token. That's a smaller gap than the last Claude generation, which actually makes model selection harder — the cost difference is no longer dramatic enough to decide for you. You have to think about the task.

Sonnet 4.6 for most work:

  • Any single-file code change
  • Document summaries and narrow Q&A
  • Short-form drafting (emails, Slack messages, code comments)
  • Anything you'd normally re-run if the first pass was off

Opus 4.7 for the harder problems:

  • Multi-file refactors where drift across turns matters
  • Security analysis where being wrong is expensive
  • Long-horizon planning tasks — anything running longer than an hour of agent time
  • Visual reasoning over real documents (charts, diagrams, scanned contracts)
  • Agent team orchestration with task budgets
  • Work where the reasoning is the product, not just scaffolding around it

The capability gap is wider than the price gap. That's the real shift with this release — for the problems where Opus matters, the delta over Sonnet has grown, while the per-token premium has not. Use it where it matters. Let Sonnet handle the rest.

Built something? Ship it.

Vibe Coder Deployment — from localhost to live.

Learn more →

AI Security Checklist

24-point checklist for shipping AI without breaking trust.

Get the checklist →