Discover why Gemini 3.1 Pro still leads Opus 4.7 on GPQA Diamond and what the 94.3% score really says about pure reasoning. Read the full guide.
Benchmarks are noisy. Hype is louder than signal. But every so often, a single number still tells us something real.
Gemini 3.1 Pro still leads because GPQA Diamond rewards precise, high-difficulty scientific reasoning, and Google's latest model appears especially strong when questions demand multi-step inference under uncertainty. The lead is tiny, but on a benchmark this hard, even a 0.1-point edge is a meaningful signal rather than marketing fluff. [1][2][3]
The number that matters here is 94.3%. That is the GPQA Diamond score cited for Gemini 3.1 Pro in Google-linked reporting, while Claude Opus 4.7 comes in at 94.2% in Anthropic-linked reporting. [2][3] On paper, that difference looks almost trivial. In practice, I would not dismiss it.
Here's why. GPQA Diamond is not another broad trivia benchmark. It is meant to probe graduate-level scientific question answering. The ReThinker paper describes expert-level scientific reasoning as a core unsolved challenge for large language models, especially on benchmarks that resist shallow retrieval and pattern matching. [1] That framing matters because GPQA Diamond sits in that same family of "do you actually reason, or do you just look smart?" evaluations.
My take: Google is still best understood as slightly ahead on pure reasoning, while Anthropic is pushing harder on operational usefulness.
GPQA Diamond measures whether a model can answer difficult science questions that require domain knowledge plus careful reasoning, rather than easy recall. In other words, it is a benchmark for intellectual discipline, not just eloquence, and that is why small deltas on it attract so much attention. [1][2]
A lot of benchmark arguments go wrong because people compare unlike things. GPQA Diamond is not SWE-bench. It is not Terminal-Bench. It is not a browsing benchmark. It is also not a consumer preference poll.
The ReThinker paper is useful here because it explains what makes expert reasoning hard: multi-step inference, domain-specific knowledge, quantitative rigor, and resistance to superficial pattern matching. [1] That lens helps us interpret why Google's 94.3% matters. It suggests Gemini 3.1 Pro is not just optimized for broad assistant behavior. It is also strong at the stricter kind of reasoning frontier.
What I noticed is that Google's positioning around Gemini 3.1 Pro also leans into this story. The model is described as a "core reasoning" model with major gains on ARC-AGI-2, HLE, and GPQA Diamond. [2] That cluster of benchmarks paints a pretty coherent picture: Google wants to win the "thinking model" narrative.
Opus 4.7 is extremely close. A 94.2% GPQA Diamond score means Anthropic is not behind in any dramatic sense; it is effectively tied at the frontier, with Google holding only a statistical nose ahead in the headline number. [3]
This is where nuance matters. If you only read the title, you might assume Gemini has opened a gap. It has not. A 0.1-point advantage is a lead, but it is a narrow one.
That said, small margins at the top are still worth tracking. Once models are clustered in the 90s on hard benchmarks, progress tends to become incremental. A tiny lead can reflect real architectural or post-training differences. It can also disappear in the next release.
A community benchmark summary on r/LocalLLaMA helps illustrate the broader picture. It shows Claude Opus 4.6 already trailing top reasoning leaders on GPQA Diamond while staying very competitive or stronger on several agentic and coding tasks. [4] That community snapshot is not a foundation source, but it matches the practical feel many power users report: Claude's sweet spot often shows up in workflow-heavy use, even when raw reasoning charts are neck and neck.
GPQA Diamond does not decide the whole debate because model quality is multidimensional. A model can lead on scientific reasoning and still lose on coding, tool use, latency, cost efficiency, or long-running agent reliability depending on the task and the product around it. [1][2][3]
This is the part people skip because it is less fun than declaring a winner.
Gemini 3.1 Pro's public benchmark profile highlights reasoning gains, including GPQA Diamond and ARC-AGI-2. [2] Opus 4.7's profile, by contrast, emphasizes software engineering, terminal workflows, and long-horizon task execution, alongside a nearly identical GPQA score. [3]
That difference shows up clearly in practice:
| Model | GPQA Diamond | Positioning strength | Better fit for |
|---|---|---|---|
| Gemini 3.1 Pro | 94.3% | Pure reasoning, science-heavy inference, big-context synthesis | Research prompts, analytical comparisons, hard explanation tasks |
| Claude Opus 4.7 | 94.2% | Agentic coding, long-running tasks, strong tool-based workflows | Coding assistants, implementation planning, software review loops |
So if you are picking a model for "who reasons best in a vacuum," Gemini still gets the nod. If you are picking a model for shipping product work, the answer may be different.
You should prompt Gemini for tightly scoped analytical reasoning and prompt Opus for reasoning embedded inside a workflow. Gemini tends to reward clean problem framing, while Opus often shines when the task includes process, verification, or execution structure around the reasoning itself. [2][3]
Here's a simple before-and-after example.
Before:
Explain which AI model is better at reasoning.
After for Gemini 3.1 Pro:
Compare Gemini 3.1 Pro and Claude Opus 4.7 on pure reasoning only.
Use GPQA Diamond as the primary benchmark.
Then explain what GPQA Diamond measures, what conclusions are justified, and what conclusions would be overstated.
Keep the answer to 5 short sections.
After for Claude Opus 4.7:
Act as an AI evaluation analyst.
Compare Gemini 3.1 Pro and Claude Opus 4.7 using GPQA Diamond as the main reasoning benchmark, then add a separate section on practical workflow implications for coding and agent tasks.
State assumptions, identify benchmark limitations, and end with a recommendation by use case.
The difference is subtle but important. Gemini prompts do well when you sharpen the intellectual target. Opus prompts often improve when you define the evaluation procedure and expected decision format.
If you do this kind of model-specific prompting a lot, tools like Rephrase can speed up the rewrite step across apps. That's especially handy when you're jumping between a browser, IDE, and Slack thread and don't want to manually tune every prompt.
Builders should take away that Google still leads by a hair on pure reasoning, but the practical model choice should still be made task by task. Benchmark wins are useful signals, not universal verdicts, and the smartest teams treat them that way. [1][2][3]
Here's my blunt read.
If your product needs hard analytical output, scientific synthesis, or deep "figure this out from first principles" responses, Gemini 3.1 Pro deserves serious attention. If your product needs coding help, tool use, or long-running structured workflows, Opus 4.7 may still be the more attractive option even while losing the GPQA headline by 0.1.
That is the real prompt engineering lesson too. Don't ask, "Which model is best?" Ask, "Best at what, under what prompt, in what workflow?"
If you want more articles on model comparisons, prompting tactics, and practical rewrites, the Rephrase blog is a good place to keep digging. And if you are constantly rewriting vague prompts into benchmark-ready ones, Rephrase is one of the cleaner ways to do it without breaking flow.
Documentation & Research
Supporting Official/Industry Reporting 2. Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77.1 Percent ARC-AGI-2 Reasoning for AI Agents - MarkTechPost (link) 3. Anthropic Launches Claude Opus 4.7 For "Most Difficult Tasks" - Analytics Vidhya (link)
Community Examples 4. Open vs Closed Source SOTA - Benchmark overview - r/LocalLLaMA (link)
GPQA Diamond is a hard benchmark built to test graduate-level scientific reasoning. It is designed to reduce easy pattern-matching and reward models that can reason through expert questions.
No. GPQA Diamond is useful for measuring a specific kind of scientific reasoning, but it does not fully predict coding, agentic behavior, browsing, or product workflow performance.