Blog / Tutorials / How to Run Gemma 4 31B Locally

How to Run Gemma 4 31B Locally

Learn how to run Gemma 4 31B locally with the right hardware, quantization choices, and Ollama setup tips. See practical examples inside.

Ilia Ilinskii
Rephrase · April 23, 2026

Tutorials8 min read

On this page

Key Takeaways What hardware do you need for Gemma 4 31B?Why does quantization matter so much for Gemma 4 31B?How should you choose a quantized build?How do you set up Gemma 4 31B in Ollama?What mistakes should you avoid when running Gemma 4 locally?References

Running a 31B model locally sounds simple until you hit the part where your machine runs out of memory, your tokens crawl, and every quantized build claims to be the "best." That's the real challenge with Gemma 4 31B.

Key Takeaways

Gemma 4 31B is realistic on high-end local hardware, but quantization is what makes it usable.
Weight quantization and KV cache behavior are two different bottlenecks, and both matter.
Ollama is the easiest entry point, but backend maturity still affects quality and stability.
If you want smooth local use, think in terms of VRAM, RAM, and context length together, not just parameter count.

What hardware do you need for Gemma 4 31B?

Gemma 4 31B can run locally on consumer hardware, but "run" and "run well" are different things. In practice, the sweet spot is a 32GB-class GPU or a high-memory Apple Silicon machine, especially if you want useful context lengths and not just a demo prompt. [1][2]

Here's the thing I noticed when looking through the sources: the model size is only half the story. The other half is memory growth from the KV cache, which expands with context length. The Open-TQ-Metal paper shows why this becomes the real limiter during long-context inference. On Gemma 4 31B, even when weights are compressed, cache memory and bandwidth still decide whether the experience feels usable or miserable [1].

A practical way to think about hardware is this:

Setup	What to expect with Gemma 4 31B
16GB VRAM GPU	Usually not enough for a comfortable 31B setup
24GB VRAM GPU	Possible with 4-bit weights and shorter context
32GB VRAM GPU	The realistic target for strong local use
64GB unified memory Mac	Viable, especially with aggressive cache optimization
CPU-only	Technically possible, rarely enjoyable

A Reddit benchmark showed Gemma 4 31B fitting at full 256K context on a single RTX 5090 with 32GB VRAM, but only by combining a Q4 weight build with compressed KV cache tricks [3]. That's impressive, but it's not the baseline setup most Ollama users should assume.

My take: if you want the model for real daily use, plan around 32GB VRAM or 64GB+ unified/system memory. Anything lower means more tradeoffs.

Why does quantization matter so much for Gemma 4 31B?

Quantization matters because it turns Gemma 4 31B from a model that barely fits into memory into one you can actually use. Without compression, both weight storage and long-context KV cache growth become too expensive for most local machines. [1]

There are two separate compression problems here. First, you compress the model weights. That's where 4-bit GGUF-style builds come in. Second, you deal with the KV cache during inference. The paper on Open-TQ-Metal makes this distinction very clear: even if the weights fit, long contexts can still break your setup because cache memory keeps growing with sequence length [1].

What's especially interesting is that not every quantization method behaves equally well on Gemma 4. The paper found that simple per-group int4 quantization stayed robust on Gemma 4 31B, while some angular KV compression approaches degraded badly because of Gemma 4's attention scaling behavior [1]. That's a big reason I'd avoid getting too clever too early.

For most local users, the sane path is still:

Start with a reputable 4-bit GGUF build.
Keep context moderate at first.
Only experiment with advanced KV compression if you know your backend supports it well.

This is also why tools like Rephrase are handy when you're testing local models. If the model is slower than cloud APIs, you want each prompt to be sharper so you waste fewer turns getting to a good answer.

How should you choose a quantized build?

The best quantized build is usually the one that balances memory, speed, and backend compatibility, not the one with the most aggressive compression. For Gemma 4 31B, a solid Q4-family build is the safest default if you want Ollama or llama.cpp-style local inference. [1][3]

A lot of people fixate on squeezing the model into the smallest footprint possible. I get the appeal. But the sources suggest that Gemma 4 is sensitive enough that quality and implementation details matter. The Open-TQ-Metal results show int4 remaining reliable on Gemma 31B, while some more exotic approaches failed outright at scale [1].

That gives us a simple comparison:

Quantization choice	Upside	Tradeoff
BF16 / FP16	Best quality	Huge memory cost
Q8	Strong quality, easier fit	Still heavy
Q4	Best practical balance	Some quality loss
Ultra-low-bit experimental	Smallest footprint	Higher risk of instability or degraded output

Community reports line up with that. One RTX 5090 user ran a Gemma 4 31B Q4_K-style build at 17.46 GiB and reported 61.5 tok/s generation with compressed cache at very long context [3]. That's great, but it also depended on backend fixes and custom branches. In plain English: use mainstream quants first.

If you're comparing lots of builds, it helps to keep a simple test prompt and rewrite it consistently. That's another place where a fast prompt optimizer like Rephrase can save time across apps, especially if you're benchmarking in terminal, browser, and notes at once.

How do you set up Gemma 4 31B in Ollama?

The easiest Ollama setup is to install Ollama, pull a Gemma 4-compatible model build, run it with a modest context window first, and only then scale up. The catch is that Ollama inherits backend limitations, so model quality depends on more than a single command. [2][3]

Here's the clean setup flow I'd recommend.

Install Ollama from the official app and confirm it runs locally.
Choose a Gemma 4 31B build that is specifically packaged for Ollama or supported through GGUF conversion.
Start with a smaller context size than the advertised maximum.
Test basic generation quality before pushing longer prompts.
Watch RAM and VRAM usage while increasing context.

A typical workflow looks like this:

ollama pull gemma4:31b
ollama run gemma4:31b

If your chosen build uses a custom tag, use that instead. The exact model name may vary depending on what Ollama exposes in its library at the time.

Here's a simple before and after for testing local setup quality:

Before	After
"Explain this code"	"Explain this Python function step by step. Identify the input, output, edge cases, and any performance issues. Keep the answer concise."

That kind of prompt cleanup matters more on local models because every extra turn costs time. If you want more prompting examples, the Rephrase blog has a lot of useful patterns you can reuse.

One important warning: early Gemma 4 community posts showed that backend bugs could make the model look worse than it really was. Some users reported broken outputs until llama.cpp-side fixes landed for chat parsing, token handling, and Gemma 4-specific behavior [3]. So if the model feels "bad," don't assume it's the weights alone.

What mistakes should you avoid when running Gemma 4 locally?

The biggest mistakes are chasing maximum context too early, picking the smallest quant without testing quality, and blaming the model when the backend is the real issue. Gemma 4 31B is powerful, but local inference is still a systems problem as much as a model problem. [1][3]

I'd keep three realities in mind. First, published max context is not the same as practical context on your machine. Second, cache compression and weight quantization solve different problems. Third, new model releases often need a few rounds of backend fixes before they feel stable.

That last point matters a lot. The community benchmark on 5090 hardware worked because it used a tuned build, explicit flags, and a backend patched for Gemma 4 support [3]. Most users in Ollama want something simpler, so expectations should be simpler too.

If you want to run Gemma 4 31B locally, don't start by asking whether it fits. Start by asking whether it fits comfortably at your target context length. That's the question that saves you hours.

And once the model is running, the next bottleneck is usually prompt quality, not installation. Better prompts make local models feel faster, sharper, and less frustrating. That's exactly why lightweight tools like Rephrase exist.

References

Documentation & Research

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon - arXiv cs.LG (link)

Community Examples 2. Gemma 4 has been released - r/LocalLLaMA (link) 3. Gemma 4 31B at 256K Full Context on a Single RTX 5090 - TurboQuant KV Cache Benchmark - r/LocalLLaMA (link)

Frequently asked

Can I run Gemma 4 31B locally on a 24GB GPU?

Yes, but usually only with aggressive weight quantization and a reduced context window. For comfortable local use, 32GB VRAM or strong unified memory helps a lot.

Does Gemma 4 31B work well in Ollama?

It can, but support depends on the quality of the model build and the backend fixes available at the time. Early Gemma 4 releases showed that backend maturity matters almost as much as raw hardware.