Blog / News / Why Mythos Solving 32 Steps Matters

Why Mythos Solving 32 Steps Matters

Discover what Mythos solving the 32-step Last Ones cyber range really means for AI security, autonomy, and risk. Read the full guide.

Ilia Ilinskii
Rephrase · May 10, 2026

News8 min read

On this page

Key Takeaways What does "solving The Last Ones" actually mean?Why is a 32-step range a bigger deal than a CTF score?What does Project Glasswing signal about AI cyber capability?How close are AI agents to real autonomous cyber offense?How should builders and security teams interpret this now?References

The interesting part about Project Glasswing is not the branding. It's the implication: frontier AI is moving from "good at cyber tasks" to "useful across long attack chains." That's a very different threshold.

Key Takeaways

Mythos being linked to success on a 32-step cyber range matters because multi-step autonomy is harder than isolated exploit generation.
The best public evidence suggests long-horizon cyber capability is improving fast, but not yet fully solved end to end [1].
The real story is not one benchmark run. It's the combination of stronger models, more inference-time compute, and better agent scaffolding [1][2].
Project Glasswing suggests major firms now view advanced AI cyber systems as practical defensive infrastructure, not just research demos [3].

What does "solving The Last Ones" actually mean?

A 32-step cyber range measures whether an AI agent can sustain progress through a long, sequential attack chain, not just win a one-shot exploit task. That makes it a much better proxy for real operational autonomy than CTF-style benchmarks, because it tests planning, memory, pivoting, and state tracking over hours of work [1].

Here's the first thing I noticed when I dug into the sources: the phrase "solved" is doing a lot of work. The strongest Tier 1 source I found, the AI Security Institute paper on multi-step cyber attack scenarios, does not say a frontier model fully completed "The Last Ones." It says the best single run reached 22 of 32 steps, with average performance rising sharply across newer models and larger token budgets [1].

That still matters a lot. In fact, I'd argue it matters more than a full-completion headline would. Why? Because it shows the bottleneck has shifted. The old question was whether models could do any serious cyber work at all. The new question is how quickly they're closing the gap on long-horizon attack execution.

"The Last Ones" is intentionally built to stress autonomy. It requires reconnaissance, credential theft, lateral movement, reverse engineering, and eventual exfiltration inside a corporate network simulation [1]. The paper also makes a crucial point: later steps demand specialist knowledge and long action sequences, so step count is not linear progress. Step 22 is not "69% done" in any simple sense [1].

Why is a 32-step range a bigger deal than a CTF score?

A multi-step cyber range matters more than a CTF score because it exposes the part current agents usually fail at: staying coherent across many dependent actions. In cyber offense, isolated skill is useful, but campaign continuity is what separates a clever model from an actually dangerous one [1][2].

This is where the second Tier 1 source helps. The Excalibur paper argues that many pentesting agents fail for two reasons: missing tools and prompts, or deeper planning and state-management limits [2]. The authors call those Type A and Type B failures. Type B is the big one here. It's the reason an agent that can solve individual tasks still falls apart halfway through a realistic intrusion.

That framing lines up almost perfectly with the AI Security Institute results. The public evidence says frontier models are improving, but they still hit hard bottlenecks around NTLM relay coordination, CI/CD compromise chains, reverse engineering, and context load [1]. In other words, the cyber challenge is becoming less about isolated exploit cleverness and more about long-range execution discipline.

That's why "Mythos solved it" should be translated into plainer English: Mythos likely demonstrates that frontier cyber systems are getting much better at maintaining useful attack momentum over extended sequences. That is more important than a flashy one-off exploit.

Here's a quick comparison:

Evaluation type	What it mainly measures	What it misses
CTF / isolated task	Exploit skill, tool use, local reasoning	Persistence, planning, campaign state
Multi-step cyber range	Autonomy, pivoting, memory, attack chaining	Some real-world noise, defenders, production messiness
Real engagement	Operational usefulness under pressure	Hard to standardize or publish safely

What does Project Glasswing signal about AI cyber capability?

Project Glasswing signals that advanced AI cyber systems are no longer being treated as speculative lab curiosities. They are being positioned as controlled, high-leverage defensive tools for securing major shared infrastructure, which is a much bigger institutional shift than any single benchmark result [3].

The supporting source from Analytics Vidhya is not Tier 1, so I'm treating it carefully. But it usefully summarizes the public story around Anthropic's initiative: major companies are participating, Anthropic is committing usage credits, and Mythos Preview is being framed as restricted because of its capability and risk profile [3].

Even if you strip away the hype, the structure matters. When a frontier model is not broadly released, but instead routed into a limited-access defensive program, that tells you the lab believes two things at once. First, the capability is real enough to be useful. Second, the misuse risk is real enough to justify tight control.

That combination is exactly what you'd expect when a system starts crossing from benchmark novelty into operational relevance.

I'd put it this way: Glasswing is less a product launch than a governance pattern. It says, "we think these systems can find and exploit serious flaws, but we want the deployment boundary to be institutions, not the public internet."

How close are AI agents to real autonomous cyber offense?

AI agents are clearly closer to meaningful autonomous cyber offense than they were a year ago, but the public evidence still shows important limits. They make progress farther and faster with more compute and better scaffolding, yet they still break on novelty, long dependency chains, and active-defense realism [1][2].

The AI Security Institute paper is especially sober here. Their ranges had no active defenders blocking progress, and detections did not impede the agent [1]. That caveat matters. A system that reaches 22 of 32 steps in a static range is not the same thing as a system that can survive a live blue team, changing network conditions, and deliberate deception.

The Excalibur paper reaches a similar conclusion from another angle. Better tooling, memory, and difficulty-aware planning improve performance a lot, but they do not erase the hardest barriers: novel exploitation, adversarial environments, and very long campaigns [2].

That's why some community reactions are useful as a supplement, not a foundation. One practical argument from the Magonia piece is that defenders are not doomed just because exploit discovery accelerates; detection, hardening, and behavior-based controls do not map one-to-one to specific exploits [4]. I think that's right. The threat is real, but the simplistic "AI finds more vulns, therefore defense collapses" story is too lazy.

How should builders and security teams interpret this now?

The right interpretation is urgency without panic. Teams should assume frontier models are becoming materially better at long-form cyber workflows and start designing for that reality, especially around attack-surface reduction, segmentation, logging, and rapid patch pipelines [1][2][3].

If you build software, the practical shift is simple: stop thinking only in terms of "can AI write an exploit?" Start thinking, "can AI chain seven boring steps I assumed only a human would bother with?" That's the more realistic threat model.

If you work with AI systems directly, there's also a prompt-engineering lesson hiding here. The systems improving fastest are not just "bigger models." They are models plus scaffolding, memory management, and better task decomposition [2]. That same pattern shows up in everyday product work too. Clean task framing matters. Tools like Rephrase are useful precisely because they make intent clearer before a model starts reasoning. In lower-stakes workflows, that improves outputs. In cyber, that kind of structure can change capabilities.

For more applied AI workflow breakdowns, the Rephrase blog is worth browsing if you want the prompting side of this story rather than the policy side.

What works well, in my view, is to treat Project Glasswing as a warning shot about trajectory, not as proof of total autonomy. Mythos didn't need to "finish all 32" for the message to land. The message is that long-horizon cyber capability is moving from partial novelty to strategic concern.

And once that happens, the argument is no longer about whether agentic AI matters in security. It's about who gets to use it first, under what controls, and how fast everyone else catches up.

References

Documentation & Research

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios - arXiv cs.AI (link)
What Makes a Good LLM Agent for Real-world Penetration Testing? - The Prompt Report / arXiv (link)

Community Examples

Project Glasswing is World's Most Powerful AI in Action - Analytics Vidhya (link)
Why a Decade of Writing Detection Logic Makes the Mythos Exploit Numbers Less Scary - Magonia Research (link)

Frequently asked

What is 'The Last Ones' 32-step cyber range?

It is a simulated corporate network attack scenario used to measure how far AI agents can autonomously progress through a long, multi-stage intrusion. The range requires chaining reconnaissance, credential theft, lateral movement, and data exfiltration across 32 sequential steps.

Why is a 32-step cyber range more important than CTF benchmarks?

Because long cyber ranges test planning, persistence, memory, and error recovery rather than isolated exploit skill. That makes them much closer to the real bottlenecks in autonomous offensive security.