Skip to content
← All posts

Everything New in Claude Opus 4.6 (and 11 Things to Try Right Now)

14 min read
ShareXLinkedIn

Anthropic dropped Claude Opus 4.6 today, and this one is packed. It is the fastest turnaround between Opus releases yet, arriving just three months after Opus 4.5 shipped in November 2025. The model is smarter, can hold more context than ever, introduces multi-agent collaboration, and lands inside PowerPoint for the first time. Here is a breakdown of everything that changed, why it matters, and a list of things worth trying the moment you get access.

What Changed: The Big Features

1 Million Token Context Window

This is the headline technical upgrade. Opus 4.6 is the first model in the Opus line to support a 1 million token context window, available in beta through the API. To put that in perspective, 1 million tokens is roughly 750,000 words, or about 10 full-length novels loaded into a single conversation.

The real story here is not just the size but the quality of retrieval at scale. On the MRCR v2 benchmark (Multi-document Reading Comprehension and Reasoning version 2), which tests how well models find specific information buried in massive amounts of text, Opus 4.6 scores 76%. Sonnet 4.5 manages 18.5% on the same test. That is not an incremental improvement. It is a qualitative shift in how much information the model can actually use rather than just technically accept as input.

For anyone who has hit the wall where Claude starts "forgetting" earlier parts of a long conversation, this is the fix. Anthropic also introduced a feature called context compaction, where the model can automatically summarize older portions of a conversation before the window fills up. This means extended sessions stay coherent instead of degrading over time.

Agent Teams

This is the feature that will get the most attention from developers and power users. Claude Code now supports agent teams, where multiple AI agents split a large task into subtasks and work on them simultaneously in parallel.

Think of it like assigning a project to a small team instead of a single person. One agent handles the frontend, another works on the Application Programming Interface (API), a third manages database migrations, and they coordinate with each other directly. Anthropic's Scott White compared it to having a talented team of humans working for you, noting that segmenting responsibilities allows agents to "coordinate in parallel and work faster."

Agent teams are available in research preview for API users and Claude Code subscribers. This is still early, but the architecture is interesting: you can define specialized roles, set up coordination logic, and observe intermediate work products as agents progress.

To demonstrate what agent teams can actually do at scale, Anthropic published a remarkable engineering blog post alongside the release: they tasked 16 parallel Claude agents with building a C compiler from scratch, in Rust, capable of compiling the Linux kernel. And then they mostly walked away.

Adaptive Thinking

Previous Claude models gave developers a binary choice: extended thinking on or off. Opus 4.6 introduces adaptive thinking, which lets the model dynamically adjust how deeply it reasons based on the complexity of what you are asking.

There are four selectable intensity levels. The default is "high," but Anthropic recommends dialing down to "medium" if you notice the model overthinking simple tasks. For developers, this translates directly into control over the tradeoff between intelligence, speed, and cost. A quick factual lookup does not need the same reasoning depth as debugging a complex authentication flow.

Claude in PowerPoint

Claude now lives directly inside PowerPoint as a side panel, available in research preview for Max, Team, and Enterprise plan customers. Previously, you could ask Claude to create a presentation, but the output was a file you had to download and then manually edit in PowerPoint. Now the entire workflow happens inside the application.

The interesting technical detail is that Claude reads your existing slide masters, layouts, fonts, and color schemes. When it generates or edits slides, it stays on brand rather than producing generic-looking output. For anyone who has spent hours reformatting AI-generated slides to match corporate templates, this is a meaningful quality-of-life improvement.

Upgraded Claude in Excel

Claude in Excel got a significant upgrade alongside the Opus 4.6 release. It can now handle longer-running, more complex tasks and multi-step changes in a single pass. It can also ingest unstructured data and infer the correct structure without you explaining the format. Pair it with the new PowerPoint integration and you have a pipeline: process and structure data in Excel, then bring it to life visually in PowerPoint.

128K Token Output

Opus 4.6 can now generate up to 128,000 tokens in a single output. That is roughly 96,000 words in one response. For context, the previous generation topped out at significantly less. This is a game-changer for tasks like generating full reports, long-form documentation, or entire codebases in a single pass. Bolt.new reported that Opus 4.6 "one-shotted a fully functional physics engine," handling a large multi-scope task without needing to be broken into pieces.

The Numbers: Benchmarks Worth Knowing

For those who track model performance, here are the highlights:

Terminal-Bench 2.0 (agentic coding): 65.4%, the highest score ever recorded. Up from 59.8% for Opus 4.5.

ARC AGI 2 (novel problem-solving): 68.8%. This nearly doubled from Opus 4.5's score of 37.6%, and significantly outpaces GPT-5.2 at 54.2% and Gemini 3 Pro at 45.1%.

GDPval-AA (real-world knowledge work): 1,606 Elo. That is 144 points ahead of GPT-5.2 and 190 points ahead of its own predecessor.

BrowseComp (finding hard-to-find information online): 84%, the best score of any model tested.

Humanity's Last Exam (multidisciplinary reasoning): 53.1% with tools, leading all frontier models.

OSWorld (agentic computer use): 72.7%, up from 66.3% for Opus 4.5.

BigLaw Bench (legal reasoning): 90.2%, with 40% perfect scores.

MRCR v2 (long-context retrieval): 76%, compared to 18.5% for Sonnet 4.5.

The ARC AGI 2 score is the most interesting one to watch. That benchmark specifically measures the ability to solve problems that are easy for humans but hard for AI systems. Nearly doubling the score in a single generation suggests something meaningful is shifting in how the model approaches unfamiliar problems.

The Cybersecurity Story

I wrote a separate deep-dive on this, but the short version: before launch, Anthropic's frontier red team gave Opus 4.6 access to standard security tools in a sandbox and pointed it at heavily fuzzed open-source codebases. With no specialized instructions, it found over 500 previously unknown zero-day vulnerabilities, all validated by humans. It outperformed Claude 4.5 in 38 of 40 cybersecurity investigation tests. Anthropic introduced six new cybersecurity-specific probes for detecting misuse alongside the release.

If you work in security, read the full article. This is the part of the release that will have the most lasting impact.

Building a C Compiler with 16 Parallel Claudes

This deserves its own section because it is one of the most impressive demonstrations of autonomous AI development I have seen.

Nicholas Carlini, a researcher on Anthropic's Safeguards team, published a detailed engineering blog post about using Opus 4.6's agent teams to build a complete C compiler. The project, called "Claude's C Compiler," is open source on GitHub and the results are genuinely startling.

Here are the raw numbers: 16 agents ran in parallel across nearly 2,000 Claude Code sessions over two weeks. The project consumed 2 billion input tokens and generated 140 million output tokens, at a total cost just under $20,000. The output is a 100,000-line Rust-based C compiler that can build Linux 6.9 on x86, ARM, and RISC-V architectures. It also compiles QEMU, FFmpeg, SQLite, PostgreSQL, and Redis. It has a 99% pass rate on most compiler test suites including the GCC torture test suite. And yes, it can compile and run Doom.

The implementation is a clean-room build with no internet access during development, depending only on the Rust standard library.

How the Agent Harness Works

The scaffolding is surprisingly simple. Each agent runs in a Docker container with a shared upstream git repository. Claude operates in a loop: finish a task, pick up the next one. To prevent agents from stepping on each other, Carlini implemented a lightweight lock system using text files. An agent claims a task by writing to a current_tasks/ directory. Git's synchronization mechanism prevents two agents from grabbing the same task. When an agent finishes, it pulls from upstream, merges changes from other agents, pushes its own changes, and releases the lock.

There is no orchestration agent coordinating the work. Each Claude instance independently decides what to work on next, typically picking the "next most obvious" problem. When stuck, agents maintain running docs of failed approaches and remaining tasks.

Specialization Across Agents

Not all 16 agents were doing the same thing. Carlini assigned different roles: some agents focused on implementing new compiler features, one was dedicated to coalescing duplicate code, another worked on compiler performance, a third focused on output code efficiency, one critiqued the overall Rust code quality and made structural improvements, and another maintained documentation.

Where It Broke Down

Carlini is candid about the limitations. The compiler still lacks a 16-bit x86 code generator needed to boot Linux out of real mode (it calls out to GCC for that phase on x86, though ARM and RISC-V compile entirely on their own). It does not have its own assembler and linker yet. The generated code is less efficient than GCC even with all optimizations disabled. And the Rust code quality, while reasonable, is not what an expert Rust developer would produce.

The most interesting failure mode was parallelism collapse. When agents started compiling the Linux kernel, they all hit the same bug simultaneously, all fixed it independently, and then overwrote each other's changes. Having 16 agents running provided no benefit because they were all stuck on the same problem. The fix was clever: use GCC as an oracle to randomly compile most kernel files, only using Claude's compiler for the remainder, allowing each agent to isolate and fix different bugs in different files.

Why This Matters

Carlini frames this as a capability benchmark, designed to stress-test the limits of what Language Models (LLMs) can barely achieve today in order to prepare for what they will reliably achieve in the future. Previous Opus 4 models could barely produce a functional compiler. Opus 4.5 was the first to cross the threshold of passing large test suites but could not compile real projects. Opus 4.6 jumped to compiling the Linux kernel.

The $20,000 price tag sounds high, but Carlini notes it is a fraction of what it would cost him to build this himself, let alone with an entire team. That cost-to-capability ratio is the real story.

His closing thought is worth quoting: "I did not expect this to be anywhere near possible so early in 2026."

The full engineering blog post is at anthropic.com/engineering/building-c-compiler and the compiler source code is on GitHub. Both are worth reading in full.

Pricing and Availability

Pricing is unchanged from Opus 4.5: $5 per million input tokens and $25 per million output tokens. The model is available today on claude.ai, through the API, and on all major cloud platforms including Amazon Web Services (AWS) Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and GitHub Copilot.

For developers, the model string is claude-opus-4-6.

There is also a new option for United States (US)-only data processing, which adds a 10% premium.

11 Fun Things to Try with Opus 4.6

Now for the practical part. Here are ten things worth testing right now to get a feel for what the new model can do.

1. Feed It an Entire Codebase

With the 1 million token context window, try loading an entire project repository into a single session. Ask Opus 4.6 to review the architecture, identify potential issues, or suggest refactoring opportunities. The long-context retrieval improvements mean it should actually remember and reference code from the beginning of the conversation, not just the most recent files.

2. Build a Full Application in One Shot

The 128K token output and improved planning capabilities make single-pass application generation practical. Try asking for a complete web application with frontend, backend, and database schema. Bolt.new got a fully functional physics engine in one shot, so push the boundaries.

3. Run a Multi-Agent Coding Project

If you have Claude Code access, set up an agent team. Define one agent for frontend work, one for backend, and one for testing. Give them a project description and watch how they coordinate. This is still in research preview, so expect some rough edges, but the architecture is genuinely novel for a first-party model feature.

4. Do a "Vibe Working" Session in Cowork

Anthropic is using the phrase "vibe working" as the knowledge-worker equivalent of "vibe coding." Open Cowork, point Claude at a folder of messy documents or data files, and give it a high-level goal like "analyze these quarterly reports and produce an executive summary with supporting charts." Let it work autonomously and see what it produces with minimal guidance.

5. Stress-Test Long Conversations

Start a complex, multi-turn conversation and keep going far longer than you normally would. Ask it to reference specific details from much earlier in the conversation. With adaptive thinking and context compaction, the model should maintain coherence across sessions that would have degraded previous models. See where it breaks.

6. Financial Modeling from Scratch

Opus 4.6 leads the Finance Agent benchmark and showed a 23 percentage-point improvement over Sonnet 4.5 on real-world finance tasks. Try handing it a set of financial data and asking for a full analysis with a Discounted Cash Flow (DCF) model, comparable company analysis, and a formatted presentation of findings. See how close to "analyst-grade" the output actually gets.

7. PowerPoint from a Brain Dump

If you are on a Max, Team, or Enterprise plan, try the Claude in PowerPoint integration. Give it a rough outline or even just a stream of consciousness about a topic and let it build a full branded deck. The fact that it reads your existing templates and slide masters means the output should not look like generic AI slides.

8. Upload a Massive Document and Play 20 Questions

Find the longest, most detailed document you have access to (a regulatory filing, a technical specification, a legal contract) and upload it. Then ask increasingly specific and obscure questions about the content. The 76% MRCR v2 score suggests it should be able to find needles in very large haystacks. Try to stump it.

9. Ask It to Debug Something That Stumped You

The improved code review and self-correction capabilities are one of the most practically useful upgrades. Take a bug you have been stuck on, provide the relevant code and context, and see if Opus 4.6 can identify the issue. Cursor's CEO specifically called out "stronger tenacity" and "better code review" as standout improvements.

10. Creative Problem-Solving Challenges

The ARC AGI 2 score (68.8%, nearly double the previous model) suggests a real improvement in tackling novel, unfamiliar problems. Try giving it abstract reasoning puzzles, unusual design challenges, or problems that require lateral thinking rather than pattern matching against training data. This is where the gap between Opus 4.6 and competing models appears widest.

11. Clone Claude's C Compiler and Poke Around

Anthropic open-sourced the 100,000-line C compiler that 16 parallel Claude agents built autonomously. Clone it from GitHub, read through the code, and try compiling your favorite C projects with it. Better yet, read the git history to watch the agents claim tasks, merge each other's changes, and debug problems in real time. It is a fascinating artifact that shows both the strengths and current ceilings of autonomous AI development. If you find a project it cannot compile, that is useful data too.

The Bigger Picture

Three months between Opus releases is a fast cadence. The model went from being primarily a developer tool to something Anthropic is clearly positioning for broad knowledge work across finance, legal, consulting, and general office productivity. The Wall Street reaction to Cowork and these capabilities (software stocks down $285 billion this week) tells you the market is taking the threat to specialized enterprise software seriously.

For individual users, the practical takeaway is straightforward: Opus 4.6 is meaningfully better at staying focused on long, complex tasks, it can hold and actually use far more context than before, and the new agent teams feature opens up workflows that were not previously possible with a single model. Whether you are writing code, building presentations, analyzing data, or doing research, there is something here worth exploring.

The model is live now. Go try it.

Built something? Ship it.

Vibe Coder Deployment — from localhost to live.

Learn more →

Part of the Learning Path

This article is referenced in the Agentic AI Learning Path: