Discover how Opus 4.7, DeepSeek V4, and Qwen 3.6 Plus handle 1M-token recall, and where long-context memory still fails. See examples inside.
The marketing line is easy now: every frontier model has "1M context." The harder question is whether that context is actually usable.
A 1M context window tells you how much text a model can ingest, but not how well it can preserve, retrieve, and reason over that text. The best recent evidence shows that models can ace simple buried-fact tests while still breaking on multi-hop chains inside the same million-token prompt [1][2].
That distinction matters. In the May 2026 paper on long-context retrieval over classical Chinese text, Claude Opus 4.7 scored 100% on single-needle retrieval at 1M tokens, alongside Gemini 3.1 Pro and GPT-5.5 [1]. But the more revealing test was multi-hop chain retrieval across 256K, 512K, and 1M. That's where the cracks showed.
The paper splits model behavior into three patterns: stable, late-cliff, and smooth-decline [1]. Claude Opus 4.7 sits in the stable group. Qwen 3.6 Plus falls into the late-cliff group. DeepSeek V4 Pro declines more steadily across tiers. That's the comparison developers actually care about, because real work rarely looks like "find one sentence and repeat it back."
Claude Opus 4.7 performed best of these three on the cited long-context recall tests, especially once the task required chaining facts rather than locating a single planted answer. DeepSeek V4 was middling and decayed gradually, while Qwen 3.6 Plus looked much weaker at 1M when the task punished memorization and forced true in-context reasoning [1].
Here's the most useful table from the paper's results, simplified for these three models:
| Model | Single-needle @ 1M | Multi-hop @ 256K | Multi-hop @ 512K | Multi-hop @ 1M | Decay pattern |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 18/18 | 4/5 | 5/5 | 3/5 | Stable |
| DeepSeek V4 Pro | 13/18 | 4/5 | 2/5 | 0/5 | Smooth decline |
| Qwen 3.6 Plus | 7/18 | 5/5 | 4/5 | 0/5 | Late cliff |
What's interesting is that Qwen 3.6 Plus looks decent until it suddenly doesn't. At 512K, it still manages 4/5 on the chain task. At 1M, it drops to 0/5 [1]. That's the kind of failure that can burn a product team, because internal testing at "large but not huge" contexts may look fine right before production traffic hits the wall.
DeepSeek V4 is different. It doesn't fall off a cliff. It just gets steadily less reliable as the context expands. In some ways, that's easier to work with, because at least the failure mode is more predictable.
Long-context recall is harder because the model must both locate information and preserve the dependency chain needed to answer correctly. Research on long-context benchmarks keeps showing that models can retrieve isolated facts more easily than they can maintain reasoning fidelity across long spans [1][2].
This is the core lesson. "Needle in a haystack" is useful, but it flatters models. The same May 2026 study found that Qwen 3.6 Plus often answered real historical facts correctly by leaning on prior training, then failed when the planted fact contradicted that memory [1]. In other words, it sometimes looked smart for the wrong reason.
That's why I'd trust the altered-needle and multi-hop tests more than the headline window size. They separate three behaviors:
If a model fails on step two or three, your million-token workflow is fragile.
This lines up with broader long-context literature too. The LongSeeker paper argues that context growth itself becomes the enemy in long-horizon tasks, and that agents need active context shaping rather than naive accumulation [2]. That's a different setup than a raw recall benchmark, but the message is similar: more context is not automatically more usable context.
DeepSeek V4's biggest win is efficiency, not pure recall quality. Its architecture is explicitly built to reduce FLOPs and KV-cache growth at extreme sequence lengths, which makes million-token operation more practical than older long-context designs [3].
That matters more than people admit. A "1M context" feature is useless if latency and memory make it impossible to run. The DeepSeek V4 architecture uses hybrid attention, compression, and lower KV-cache overhead to make long contexts cheaper to serve [3]. According to the Hugging Face write-up based on the technical report, MRCR 8-needle retrieval stays above 0.82 through 256K and lands at 0.59 at 1M for V4-Pro-Max [3].
That's respectable. But recall the difference between retrieval curves and reasoning curves. In the external paper, DeepSeek V4 Pro dropped from 4/5 at 256K to 2/5 at 512K to 0/5 at 1M on three-hop chain recall [1]. So my take is simple: DeepSeek V4 looks like a strong systems-oriented model for large contexts, but not the winner here on long-context reasoning fidelity.
A practical Reddit test on large codebases says something similar. One user found DeepSeek V4 useful up to roughly 150K-250K tokens for coding, with precision degrading past 300K and implementation detail fading by 520K [4]. That's anecdotal, not foundational, but it matches the broader pattern.
For 1M-context tasks, you should prompt as if the model will forget structure before it forgets raw text. That means explicitly naming the retrieval target, enforcing intermediate checks, and separating evidence gathering from synthesis.
Here's a basic before-and-after I'd use for long-context recall work.
| Before | After |
|---|---|
| "Read this repo and tell me where the auth bug is." | "Read the attached repo. First, identify the authentication flow entry points. Second, trace token validation across files. Third, list the exact files and functions involved. Only then propose the likely bug and cite the evidence you used." |
And for document reasoning:
You are answering from the provided context only.
Task:
1. Find the relevant passages.
2. Quote or summarize the exact evidence.
3. If the answer requires linking multiple passages, state the chain explicitly.
4. If the evidence is missing or contradictory, say so.
5. Do not rely on outside knowledge unless I ask.
Question: [your question]
Here's what I noticed: this kind of scaffolding matters more as context grows. It won't magically turn Qwen's 1M cliff into stability, but it reduces ambiguity and makes failures easier to detect.
This is also where tools like Rephrase are useful. If you're constantly moving between ChatGPT, Claude, your IDE, and docs, tightening a loose "read all this and help" prompt into a structured long-context instruction can save a surprising amount of trial and error. If you want more patterns like this, the Rephrase blog is full of prompt breakdowns for real workflows.
You should trust 1M context for some workloads, but not as a blanket replacement for retrieval. The current evidence says long-context models can handle simple lookups well, while multi-hop reasoning and reliability under extreme prompt length still vary a lot by model [1][2].
My rule of thumb is blunt. If the task is mostly "search this large blob and summarize," long context is great. If the task is "trace dependencies, keep entity bindings straight, and cite exact evidence across distant sections," you still want retrieval discipline, chunking strategy, or a memory layer.
That's especially true when teams benchmark only on happy-path prompts. The 512K-to-1M jump is where weak long-context behavior gets exposed [1]. Don't ship based on the label. Ship based on the failure curve.
And yes, prompt optimization helps. Even a lightweight tool like Rephrase can be handy when you need to turn messy cross-app instructions into cleaner evidence-first prompts before dropping a giant context into Claude or DeepSeek.
The 1M context era is here. The catch is that usable recall is still uneven. Right now, Opus 4.7 looks like the safest bet of these three for long-context recall, DeepSeek V4 looks more efficient than faithful at the edge, and Qwen 3.6 Plus looks fine right up until it very much doesn't.
Documentation & Research
Community Examples
A 1M context window means a model can accept roughly one million input tokens in a single prompt. The important catch is that accepted context and usable context are not the same thing.
Based on the cited May 2026 evaluation, Claude Opus 4.7 is one of the strongest on both single-needle retrieval and multi-hop recall at 1M. DeepSeek V4 and Qwen 3.6 Plus show steeper degradation.