Blog / Tools / Why 1M Context Still Breaks

Why 1M Context Still Breaks

Discover how Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus handle 1M-token recall, and where long-context memory still fails. See examples inside.

Ilia Ilinskii
Rephrase · May 23, 2026

Tools8 min read

On this page

Key Takeaways What does "1M context" actually tell you?How did Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus compare?Why is long-context recall harder than simple retrieval?What does DeepSeek V4 get right?How should you prompt for 1M-context tasks?Should you trust 1M context instead of RAG?References

The marketing line is easy now: every frontier model has "1M context." The harder question is whether that context is actually usable.

Key Takeaways

1M tokens is a capacity number, not a reliability number.
Single-fact retrieval is getting solved faster than multi-hop recall.
Claude Opus 4.7 looks stronger than DeepSeek V4 and Qwen 3.6 Plus on long-context reasoning, not just raw retrieval.
The real breakpoint is often the jump from 512K to 1M, not 32K to 128K.
Prompt structure still matters, because long context does not remove the need for retrieval discipline.

What does "1M context" actually tell you?

A 1M context window tells you how much text a model can ingest, but not how well it can preserve, retrieve, and reason over that text. The best recent evidence shows that models can ace simple buried-fact tests while still breaking on multi-hop chains inside the same million-token prompt [1][2].

That distinction matters. In the May 2026 paper on long-context retrieval over classical Chinese text, Claude Opus 4.7 scored 100% on single-needle retrieval at 1M tokens, alongside Gemini 3.1 Pro and GPT-5.5 [1]. But the more revealing test was multi-hop chain retrieval across 256K, 512K, and 1M. That's where the cracks showed.

The paper splits model behavior into three patterns: stable, late-cliff, and smooth-decline [1]. Claude Opus 4.7 sits in the stable group. Qwen 3.6 Plus falls into the late-cliff group. DeepSeek V4 Pro declines more steadily across tiers. That's the comparison developers actually care about, because real work rarely looks like "find one sentence and repeat it back."

How did Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus compare?

Claude Opus 4.7 performed best of these three on the cited long-context recall tests, especially once the task required chaining facts rather than locating a single planted answer. DeepSeek V4 was middling and decayed gradually, while Qwen 3.6 Plus looked much weaker at 1M when the task punished memorization and forced true in-context reasoning [1].

Here's the most useful table from the paper's results, simplified for these three models:

Model	Single-needle @ 1M	Multi-hop @ 256K	Multi-hop @ 512K	Multi-hop @ 1M	Decay pattern
Claude Opus 4.7	18/18	4/5	5/5	3/5	Stable
DeepSeek V4 Pro	13/18	4/5	2/5	0/5	Smooth decline
Qwen 3.6 Plus	7/18	5/5	4/5	0/5	Late cliff

What's interesting is that Qwen 3.6 Plus looks decent until it suddenly doesn't. At 512K, it still manages 4/5 on the chain task. At 1M, it drops to 0/5 [1]. That's the kind of failure that can burn a product team, because internal testing at "large but not huge" contexts may look fine right before production traffic hits the wall.

DeepSeek V4 is different. It doesn't fall off a cliff. It just gets steadily less reliable as the context expands. In some ways, that's easier to work with, because at least the failure mode is more predictable.

Why is long-context recall harder than simple retrieval?

Long-context recall is harder because the model must both locate information and preserve the dependency chain needed to answer correctly. Research on long-context benchmarks keeps showing that models can retrieve isolated facts more easily than they can maintain reasoning fidelity across long spans [1][2].

This is the core lesson. "Needle in a haystack" is useful, but it flatters models. The same May 2026 study found that Qwen 3.6 Plus often answered real historical facts correctly by leaning on prior training, then failed when the planted fact contradicted that memory [1]. In other words, it sometimes looked smart for the wrong reason.

That's why I'd trust the altered-needle and multi-hop tests more than the headline window size. They separate three behaviors:

Can the model notice buried information?
Can it prefer prompt evidence over memorized priors?
Can it follow a chain across distant parts of the prompt?

If a model fails on step two or three, your million-token workflow is fragile.

This lines up with broader long-context literature too. The LongSeeker paper argues that context growth itself becomes the enemy in long-horizon tasks, and that agents need active context shaping rather than naive accumulation [2]. That's a different setup than a raw recall benchmark, but the message is similar: more context is not automatically more usable context.

What does DeepSeek V4 get right?

DeepSeek V4's biggest win is efficiency, not pure recall quality. Its architecture is explicitly built to reduce FLOPs and KV-cache growth at extreme sequence lengths, which makes million-token operation more practical than older long-context designs [3].

That matters more than people admit. A "1M context" feature is useless if latency and memory make it impossible to run. The DeepSeek V4 architecture uses hybrid attention, compression, and lower KV-cache overhead to make long contexts cheaper to serve [3]. According to the Hugging Face write-up based on the technical report, MRCR 8-needle retrieval stays above 0.82 through 256K and lands at 0.59 at 1M for V4-Pro-Max [3].

That's respectable. But recall the difference between retrieval curves and reasoning curves. In the external paper, DeepSeek V4 Pro dropped from 4/5 at 256K to 2/5 at 512K to 0/5 at 1M on three-hop chain recall [1]. So my take is simple: DeepSeek V4 looks like a strong systems-oriented model for large contexts, but not the winner here on long-context reasoning fidelity.

A practical Reddit test on large codebases says something similar. One user found DeepSeek V4 useful up to roughly 150K-250K tokens for coding, with precision degrading past 300K and implementation detail fading by 520K [4]. That's anecdotal, not foundational, but it matches the broader pattern.

How should you prompt for 1M-context tasks?

For 1M-context tasks, you should prompt as if the model will forget structure before it forgets raw text. That means explicitly naming the retrieval target, enforcing intermediate checks, and separating evidence gathering from synthesis.

Here's a basic before-and-after I'd use for long-context recall work.

Before	After
"Read this repo and tell me where the auth bug is."	"Read the attached repo. First, identify the authentication flow entry points. Second, trace token validation across files. Third, list the exact files and functions involved. Only then propose the likely bug and cite the evidence you used."

And for document reasoning:

You are answering from the provided context only.

Task:
1. Find the relevant passages.
2. Quote or summarize the exact evidence.
3. If the answer requires linking multiple passages, state the chain explicitly.
4. If the evidence is missing or contradictory, say so.
5. Do not rely on outside knowledge unless I ask.

Question: [your question]

Here's what I noticed: this kind of scaffolding matters more as context grows. It won't magically turn Qwen's 1M cliff into stability, but it reduces ambiguity and makes failures easier to detect.

This is also where tools like Rephrase are useful. If you're constantly moving between ChatGPT, Claude, your IDE, and docs, tightening a loose "read all this and help" prompt into a structured long-context instruction can save a surprising amount of trial and error. If you want more patterns like this, the Rephrase blog is full of prompt breakdowns for real workflows.

Should you trust 1M context instead of RAG?

You should trust 1M context for some workloads, but not as a blanket replacement for retrieval. The current evidence says long-context models can handle simple lookups well, while multi-hop reasoning and reliability under extreme prompt length still vary a lot by model [1][2].

My rule of thumb is blunt. If the task is mostly "search this large blob and summarize," long context is great. If the task is "trace dependencies, keep entity bindings straight, and cite exact evidence across distant sections," you still want retrieval discipline, chunking strategy, or a memory layer.

That's especially true when teams benchmark only on happy-path prompts. The 512K-to-1M jump is where weak long-context behavior gets exposed [1]. Don't ship based on the label. Ship based on the failure curve.

And yes, prompt optimization helps. Even a lightweight tool like Rephrase can be handy when you need to turn messy cross-app instructions into cleaner evidence-first prompts before dropping a giant context into Claude or DeepSeek.

The 1M context era is here. The catch is that usable recall is still uneven. Right now, Opus 4.7 looks like the safest bet of these three for long-context recall, DeepSeek V4 looks more efficient than faithful at the edge, and Qwen 3.6 Plus looks fine right up until it very much doesn't.

References

Documentation & Research

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text - arXiv cs.AI (link)
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents - The Prompt Report (link)
DeepSeek-V4: a million-token context that agents can actually use - Hugging Face Blog (link)

Community Examples

Deepseek V4's 1M context window: the breaking point - r/LocalLLaMA (link)

Frequently asked

What does a 1M context window actually mean?

A 1M context window means a model can accept roughly one million input tokens in a single prompt. The important catch is that accepted context and usable context are not the same thing.

Which model is best at 1M-token recall?

Based on the cited May 2026 evaluation, Claude Opus 4.7 is one of the strongest on both single-needle retrieval and multi-hop recall at 1M. DeepSeek V4 and Qwen 3.6 Plus show steeper degradation.