Blog / Tools / 1M Context Recall: Opus vs DeepSeek vs Q…

1M Context Recall: Opus vs DeepSeek vs Qwen

Discover how Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus handle 1M-token recall and multi-hop reasoning. See where each model breaks. Read on.

Ilia Ilinskii
Rephrase · May 11, 2026

Tools7 min read

On this page

Key Takeaways What does "1M context recall" actually mean?Which model wins on long-context recall?Why does Qwen 3.6 Plus break at the top end?Before → after prompt example Why is DeepSeek V4 still important if recall is weaker?How should you prompt in the 1M-context era?References

The 1M-token era is real. The catch is that "supports 1M context" and "can actually remember what matters at 1M" are not the same thing.

Key Takeaways

Claude Opus 4.7 is the most reliable of these three on long-context recall plus multi-hop reasoning at 1M tokens.[1]
DeepSeek V4 is the most interesting architectural bet for cheap, usable long context, but its recall still degrades on harder tasks.[2]
Qwen 3.6 Plus can look strong at mid-range context, then fall off hard near the top end of the window.[1]
The practical breakpoint is often not 1M itself. It is the jump from 512K to 1M.[1]
If accuracy matters, better prompting and context shaping still matter more than raw window size.

What does "1M context recall" actually mean?

1M context recall means more than fitting a million tokens into the prompt. It means the model can still retrieve, connect, and reason over information buried deep in that prompt without drifting, hallucinating, or defaulting to memorized facts. That distinction is the whole story here.[1]

A recent paper evaluating five frontier 1M-context models on classical Chinese text makes this brutally clear. In simple single-needle retrieval, Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5 all hit 100%. But once the task shifted to multi-hop chain traversal across 256K, 512K, and 1M-token tiers, performance split into very different decay patterns.[1]

That matters because most real work is not "find one sentence." It is "find three related facts scattered across a repo, spec, Slack log, and meeting notes, then tell me what they imply."

Which model wins on long-context recall?

Claude Opus 4.7 is the strongest all-around performer here because it combines perfect 1M single-needle retrieval with the best multi-hop durability of the three models in this comparison. It is not flawless, but it degrades more gracefully than Qwen 3.6 Plus and DeepSeek V4 Pro on the cited benchmark.[1]

Here is the cleanest comparison from the paper:

Model	1M single-needle retrieval	Multi-hop at 256K	Multi-hop at 512K	Multi-hop at 1M	Pattern
Claude Opus 4.7	18/18	4/5	5/5	3/5	Mostly stable
Qwen 3.6 Plus	7/18	5/5	4/5	0/5	Late cliff
DeepSeek V4 Pro	13/18	4/5	2/5	0/5	Smooth decline

What I notice is that the headline "1M context" hides two different skills. Opus 4.7 is better at actually using the window. Qwen 3.6 Plus can look excellent until it suddenly stops being excellent. DeepSeek V4 is less dramatic, but it loses ground steadily.[1]

That makes Opus the safer choice for high-stakes document QA, repo-wide debugging, and multi-file planning where recall errors become expensive.

Why does Qwen 3.6 Plus break at the top end?

Qwen 3.6 Plus appears to hold up through 512K on multi-hop retrieval, then collapses at 1M, which suggests its practical recall ceiling is much lower than its advertised context ceiling. In the cited study, it went from 4/5 at 512K to 0/5 at 1M on the chain task.[1]

That is the classic "late cliff" pattern. The model looks healthy during testing if you never push it to the wall. Then you hand it a giant corpus and ask for chained reasoning, and it starts answering from prior knowledge or grabs the wrong entity.

The paper also found something uglier in the single-needle test: Qwen often answered with historically memorized facts instead of the altered facts actually planted in-context. In other words, sometimes it was not reading the prompt well enough. It was guessing from training memory.[1]

That is exactly the kind of failure developers miss because the answer can still sound smart.

Before → after prompt example

If you must use a model with this kind of failure mode, vague prompts are a bad idea.

Before:

Read this whole archive and tell me who approved the pricing change and why.

After:

You must answer only from the provided context.

Task:
1. Find the earliest message that explicitly approves the pricing change.
2. Quote the approving sentence verbatim.
3. Identify the person who wrote it.
4. Find the nearest message that states the reason.
5. If any step is missing, say "not found in context."

Do not use prior knowledge. Do not infer names or reasons unless quoted or directly stated.

This will not magically fix weak long-context recall, but it reduces the model's freedom to improvise. That is also where tools like Rephrase help: they turn rough requests into sharper, constraint-heavy prompts without making you manually rewrite every query.

Why is DeepSeek V4 still important if recall is weaker?

DeepSeek V4 matters because it attacks the infrastructure problem of long context, not just the benchmark problem. It makes 1M-token inference far cheaper in compute and KV cache terms, which is crucial for long-running agent workflows. That is a big deal even if recall is not best-in-class.[2]

The Hugging Face technical write-up highlights what DeepSeek changed: hybrid attention with Compressed Sparse Attention and Heavily Compressed Attention, major KV cache reduction, and a design aimed at keeping long agent traces usable over time.[2] The post reports that V4-Pro uses 27% of the single-token inference FLOPs of DeepSeek-V3.2 at 1M context and just 10% of the KV cache memory, while V4-Flash goes lower.[2]

That is not just an engineering flex. It changes what is deployable.

Here is the practical split:

Model	Best use case
Claude Opus 4.7	Highest-confidence long-context reasoning and recall
DeepSeek V4	Cost-sensitive long agent runs with heavy context growth
Qwen 3.6 Plus	Mid-range long-context tasks where you stay below the cliff

DeepSeek's architecture also lines up with what newer agent research keeps showing: raw accumulation is a bad memory strategy. The LongSeeker paper argues that long-horizon agents need active context orchestration like compressing, deleting, rolling back, and preserving snippets instead of blindly appending everything forever.[3]

That is the part many teams still miss. Bigger windows delay the mess. They do not remove it.

How should you prompt in the 1M-context era?

In the 1M-context era, the best prompts behave less like chat and more like retrieval protocols. You need to tell the model what to extract, how to verify it, what not to assume, and how to respond when evidence is missing. That is how you defend against context rot.[1][3]

Here's the workflow I recommend.

Start with an extraction-first prompt. Ask the model to locate and quote evidence before summarizing.
Force source grounding. Tell it to distinguish "found in context" from "inferred."
Break multi-hop tasks into explicit substeps. Don't assume the model will chain them cleanly.
Ask for uncertainty when evidence is weak or conflicting.
Keep using retrieval and compression when the task spans many turns.

A lot of developers still prompt 1M-context models like they are just bigger chatbots. They are not. They are more like messy working-memory systems. If you want better outputs, you need tighter instructions. For more articles on that, the Rephrase blog is worth browsing.

And if you work across IDEs, docs, Slack, and browser tabs all day, Rephrase is genuinely useful because this kind of prompt cleanup is repetitive. The faster move is often not "pick a bigger model." It is "write a stricter prompt."

References

Documentation & Research

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text - arXiv cs.AI (link)
DeepSeek-V4: a million-token context that agents can actually use - Hugging Face Blog (link)
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents - The Prompt Report / arXiv (link)

Community Examples

I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search) - r/LocalLLaMA (link)

Frequently asked

Which model is best at 1M-context recall right now?

Based on the cited evaluation, Claude Opus 4.7 is the strongest of these three on balanced long-context recall and multi-hop reasoning. It matched perfect single-needle retrieval and held up better than Qwen 3.6 Plus and DeepSeek V4 on harder chained retrieval tasks.

Why does long-context recall fail even when a model supports 1M tokens?

Because advertised capacity is not the same as usable recall. Attention efficiency, tokenization, retrieval behavior, and reasoning over distant facts all affect whether the model can actually use the full window.