Discover how Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus handle 1M-token recall and multi-hop reasoning. See where each model breaks. Read on.
The 1M-token era is real. The catch is that "supports 1M context" and "can actually remember what matters at 1M" are not the same thing.
1M context recall means more than fitting a million tokens into the prompt. It means the model can still retrieve, connect, and reason over information buried deep in that prompt without drifting, hallucinating, or defaulting to memorized facts. That distinction is the whole story here.[1]
A recent paper evaluating five frontier 1M-context models on classical Chinese text makes this brutally clear. In simple single-needle retrieval, Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5 all hit 100%. But once the task shifted to multi-hop chain traversal across 256K, 512K, and 1M-token tiers, performance split into very different decay patterns.[1]
That matters because most real work is not "find one sentence." It is "find three related facts scattered across a repo, spec, Slack log, and meeting notes, then tell me what they imply."
Claude Opus 4.7 is the strongest all-around performer here because it combines perfect 1M single-needle retrieval with the best multi-hop durability of the three models in this comparison. It is not flawless, but it degrades more gracefully than Qwen 3.6 Plus and DeepSeek V4 Pro on the cited benchmark.[1]
Here is the cleanest comparison from the paper:
| Model | 1M single-needle retrieval | Multi-hop at 256K | Multi-hop at 512K | Multi-hop at 1M | Pattern |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 18/18 | 4/5 | 5/5 | 3/5 | Mostly stable |
| Qwen 3.6 Plus | 7/18 | 5/5 | 4/5 | 0/5 | Late cliff |
| DeepSeek V4 Pro | 13/18 | 4/5 | 2/5 | 0/5 | Smooth decline |
What I notice is that the headline "1M context" hides two different skills. Opus 4.7 is better at actually using the window. Qwen 3.6 Plus can look excellent until it suddenly stops being excellent. DeepSeek V4 is less dramatic, but it loses ground steadily.[1]
That makes Opus the safer choice for high-stakes document QA, repo-wide debugging, and multi-file planning where recall errors become expensive.
Qwen 3.6 Plus appears to hold up through 512K on multi-hop retrieval, then collapses at 1M, which suggests its practical recall ceiling is much lower than its advertised context ceiling. In the cited study, it went from 4/5 at 512K to 0/5 at 1M on the chain task.[1]
That is the classic "late cliff" pattern. The model looks healthy during testing if you never push it to the wall. Then you hand it a giant corpus and ask for chained reasoning, and it starts answering from prior knowledge or grabs the wrong entity.
The paper also found something uglier in the single-needle test: Qwen often answered with historically memorized facts instead of the altered facts actually planted in-context. In other words, sometimes it was not reading the prompt well enough. It was guessing from training memory.[1]
That is exactly the kind of failure developers miss because the answer can still sound smart.
If you must use a model with this kind of failure mode, vague prompts are a bad idea.
Before:
Read this whole archive and tell me who approved the pricing change and why.
After:
You must answer only from the provided context.
Task:
1. Find the earliest message that explicitly approves the pricing change.
2. Quote the approving sentence verbatim.
3. Identify the person who wrote it.
4. Find the nearest message that states the reason.
5. If any step is missing, say "not found in context."
Do not use prior knowledge. Do not infer names or reasons unless quoted or directly stated.
This will not magically fix weak long-context recall, but it reduces the model's freedom to improvise. That is also where tools like Rephrase help: they turn rough requests into sharper, constraint-heavy prompts without making you manually rewrite every query.
DeepSeek V4 matters because it attacks the infrastructure problem of long context, not just the benchmark problem. It makes 1M-token inference far cheaper in compute and KV cache terms, which is crucial for long-running agent workflows. That is a big deal even if recall is not best-in-class.[2]
The Hugging Face technical write-up highlights what DeepSeek changed: hybrid attention with Compressed Sparse Attention and Heavily Compressed Attention, major KV cache reduction, and a design aimed at keeping long agent traces usable over time.[2] The post reports that V4-Pro uses 27% of the single-token inference FLOPs of DeepSeek-V3.2 at 1M context and just 10% of the KV cache memory, while V4-Flash goes lower.[2]
That is not just an engineering flex. It changes what is deployable.
Here is the practical split:
| Model | Best use case |
|---|---|
| Claude Opus 4.7 | Highest-confidence long-context reasoning and recall |
| DeepSeek V4 | Cost-sensitive long agent runs with heavy context growth |
| Qwen 3.6 Plus | Mid-range long-context tasks where you stay below the cliff |
DeepSeek's architecture also lines up with what newer agent research keeps showing: raw accumulation is a bad memory strategy. The LongSeeker paper argues that long-horizon agents need active context orchestration like compressing, deleting, rolling back, and preserving snippets instead of blindly appending everything forever.[3]
That is the part many teams still miss. Bigger windows delay the mess. They do not remove it.
In the 1M-context era, the best prompts behave less like chat and more like retrieval protocols. You need to tell the model what to extract, how to verify it, what not to assume, and how to respond when evidence is missing. That is how you defend against context rot.[1][3]
Here's the workflow I recommend.
A lot of developers still prompt 1M-context models like they are just bigger chatbots. They are not. They are more like messy working-memory systems. If you want better outputs, you need tighter instructions. For more articles on that, the Rephrase blog is worth browsing.
And if you work across IDEs, docs, Slack, and browser tabs all day, Rephrase is genuinely useful because this kind of prompt cleanup is repetitive. The faster move is often not "pick a bigger model." It is "write a stricter prompt."
Documentation & Research
Community Examples
Based on the cited evaluation, Claude Opus 4.7 is the strongest of these three on balanced long-context recall and multi-hop reasoning. It matched perfect single-needle retrieval and held up better than Qwen 3.6 Plus and DeepSeek V4 on harder chained retrieval tasks.
Because advertised capacity is not the same as usable recall. Attention efficiency, tokenization, retrieval behavior, and reasoning over distant facts all affect whether the model can actually use the full window.