Skip to content
owerczuk.dev
Back to Blog
Open Source LLM
Gemma 4
Local AI
Edge AI

Gemma 4 is here: Google just dropped its most capable open model

Google released Gemma 4 under Apache 2.0 with four model sizes, multimodal support, and edge deployment. Here's what developers need to know.

April 2, 202613 min

So Google actually did it. They took their best open model family, made it genuinely competitive with the top tier, and slapped an Apache 2.0 license on it. No MAU caps. No weird acceptable use clauses. No "open but actually read the fine print" situation.

Gemma 4 dropped today, April 2, 2026. I've been poking at the preview builds for the past week, and I wanted to get my thoughts down while they're fresh. If you're building anything with open models right now, or you've been meaning to start, keep reading. There's a lot to unpack.

The four model sizes

Gemma 4 comes in four flavors. Google wasn't subtle about the naming:

E2B and E4B are the edge models. "E" stands for "Effective," so E2B activates about 2 billion parameters during inference even though the actual weights are bigger. Same idea with E4B but with 4 billion active params. The trick is that you get decent performance while keeping RAM and battery in check. Google says up to 60% less battery than previous Gemma models. I haven't verified that claim myself, but the E2B definitely feels snappy on my test phone.

E2B is roughly 3x faster than E4B. If you care about latency on mobile, go with E2B. If you need the model to actually think harder, E4B.

Both support 128K context. Both handle text, images (variable aspect ratios, no preprocessing needed), and here's the interesting part: native audio input. Only these two models get audio. If you're building anything voice-related on device, that's relevant.

The 26B model uses Mixture of Experts. 26 billion total parameters, but only 4 billion activate per forward pass. It's ranked #6 on the Arena AI text leaderboard right now, which is wild for a model that fits in ~18GB at 4-bit quantization. A single consumer GPU handles it.

The 31B dense model is the big one. Every parameter fires on every token. Currently #3 on Arena AI's text leaderboard, behind GLM-5 and Kimi K2.5. You need about 20GB at 4-bit or 34GB at 8-bit. An RTX 4090 or a 32GB M-series Mac runs it fine.

The 26B and 31B both get 256K context windows. Enough to fit a decent-sized codebase in one shot.

Model4-bit RAM8-bit RAM16-bit RAMContext
E2B~5 GB~8 GB~15 GB128K
E4B~5 GB~8 GB~15 GB128K
26B MoE~18 GB~28 GB--256K
31B Dense~20 GB~34 GB--256K

The licensing thing is the real story

I keep coming back to this. Previous Gemma releases had Google's custom license. It looked open, sure. But if you actually read it, there were acceptable use restrictions, and the terms could change. Legal teams at real companies flagged this stuff constantly.

Gemma 4 is Apache 2.0. That's it. Nothing else to parse.

You can modify the weights, redistribute them, build commercial products on top of them, and Google cannot retroactively change the deal. This is the same license that governs Kubernetes, TensorFlow, and a huge chunk of the open-source ecosystem. Everyone understands it. No lawyer needs to spend three days parsing edge cases.

Compare that to Llama 4. Meta's license has a 700 million Monthly Active User cap. Exceed it and you need a separate agreement. There are also EU-specific restrictions, which is a headache if you're deploying in Europe. (And I am, since I'm based in Poland, so this matters to me personally.)

My buddy Tomek works backend at a fintech in Warsaw. He'd been running Gemma 3 for internal document classification. When they tried to turn it into a customer-facing product, the licensing ambiguity killed the project for two months. Legal got involved, things stalled, they ended up switching models entirely. He pinged me this morning: "Finally, I can just use Gemma." That's the whole point. Apache 2.0 removes that kind of friction.

If you're picking an open model for a commercial product, the Apache 2.0 license text is short enough to read over coffee. Do it.

Where the benchmarks land

I don't think benchmarks tell you everything. But they tell you something, and here's what they're saying about Gemma 4.

The 31B model hits 89.2% on competition-level math. That's the single biggest benchmark improvement in this generation of open models, and it beats some models 20x its size on specific math tasks. If your use case involves numerical reasoning or formal logic, this is one of the best open options out there.

On Arena AI (which is basically a crowdsourced vibe check for model quality): the 31B is #3, the 26B is #6. The models above them are GLM-5 and Kimi K2.5, both Chinese-developed, both strong, both with their own ecosystem quirks.

Where it falls behind? Coding and general chat. Qwen 3.5 is still better on LiveCodeBench and SWE-bench. Llama 4 Maverick has the highest MMLU at 85.5%. GLM-5 leads HumanEval at 94.2% for raw code generation.

So no, Gemma 4 doesn't win everything. But math and reasoning are very strong, it holds its own elsewhere, and you can run it on hardware you probably already have.

Multimodal across the board

Every Gemma 4 model takes images as input. Variable aspect ratios and resolutions, so you don't need to crop or resize anything before feeding it in. The model deals with it.

The E2B and E4B models also accept audio. On-device speech recognition, voice commands, audio classification, all running locally with no cloud call. For the edge models, video is supported too. The documentation is thin on video processing specifics, but the model card confirms it.

Think about what this means if you're building mobile apps. An app that sees, hears, and reads text, running offline on the user's phone, in 5GB of RAM. A year ago you couldn't do that with open models. Now you can.

How to actually run it

Fastest way? Ollama.

ollama run gemma4

That's it. It pulls a default model (usually the 26B quantized) and you're chatting. For the edge variants:

ollama run gemma4:e4b
ollama run gemma4:e2b

If you prefer a GUI, LM Studio works. Search "gemma4," pick your quantization, download, run.

For more control, llama.cpp with GGUF files from Hugging Face. Best if you need to fine-tune inference settings or plug into a C++ pipeline.

Fine-tuning? Unsloth has Gemma 4 workflows documented already.

GPU stuff: NVIDIA released optimized builds for RTX cards. If you have a 3090, 4080, or 4090, you get 3-10x over CPU. AMD GPUs and TPUs are supported too.

One thing Android developers should know: Google baked Gemma 4 into Android Studio for local coding assistance. The AI Edge Gallery handles deployment to phones. If you're already in the Android ecosystem, the integration path is basically done.

What hardware do you actually need?

This is the question I keep getting, so let me just lay it out. The answer depends on which model you want and how fast you need it.

The thing most people miss: for LLM inference, memory bandwidth matters more than raw compute. An M3 Max with 48GB (400 GB/s bandwidth) will generate tokens faster than an M4 Pro with 24GB (273 GB/s), even though the M4 Pro is the newer chip. The model weights need to move from memory to compute on every single token, so bandwidth is the bottleneck.

Here's my rough guide based on what I've tested and what community benchmarks show:

HardwareUnified/VRAMBest Gemma 4 modelExpected speedNotes
MacBook Air M2 / M3, 16GB16 GBE4B (4-bit)~25-35 tok/sRuns smooth, stays quiet. The E2B is even faster if you don't need E4B's reasoning.
MacBook Air M3 / M4, 24GB24 GBE4B or 26B MoE (4-bit)~15-25 tok/s on 26BThe 26B fits but context window is limited. E4B is the safer bet here.
MacBook Pro M3 Pro / M4 Pro, 36GB36 GB26B MoE (4-bit)~18-28 tok/sComfortable fit. Enough room for 26B plus a decent context window.
MacBook Pro M3 Max / M4 Max, 48GB48 GB31B Dense (4-bit)~25-35 tok/sThe sweet spot for the flagship model. The Max chips have 400-546 GB/s bandwidth, which makes a real difference.
MacBook Pro M4 Max / M3 Ultra, 64-128GB64-128 GB31B Dense (8-bit)~30-45 tok/sYou can skip aggressive quantization and run at higher precision. Quality goes up noticeably.
Mac Studio M3 Ultra / M4 Ultra, 192GB192 GB31B Dense (16-bit) or multiple models~35-50+ tok/sOverkill for a single model. Great if you want to run multiple models simultaneously or load full precision.

For the PC side:

GPUVRAMBest Gemma 4 modelExpected speedNotes
RTX 4060 Ti 16GB16 GBE4B or 26B MoE (heavy quantized)~20-30 tok/s on E4BThe 26B is a tight squeeze. Stick to E4B for comfortable use.
RTX 4070 Ti Super16 GBE4B, 26B MoE (4-bit, tight)~25-35 tok/s on E4BSame VRAM ceiling as 4060 Ti but faster compute.
RTX 4080 / 4080 Super16 GBE4B, 26B MoE (4-bit)~30-40 tok/s on E4BStill 16GB VRAM. NVIDIA's optimizations help squeeze more out.
RTX 409024 GB26B MoE or 31B Dense (4-bit)~40-55 tok/sThe 31B fits at 4-bit (20GB). Fast enough for real-time use. This is probably the best single-GPU option for Gemma 4's full lineup.
RTX 509032 GB31B Dense (8-bit)~50-70 tok/sEnough VRAM for higher precision. The 31B at 8-bit (34GB fits with some tricks) runs well at this precision.
2x RTX 4090 (tensor parallel)48 GB31B Dense (8-bit)~60-80 tok/sIf you already have two 4090s, tensor parallelism via llama.cpp gives you headroom for 8-bit and big context.

Some observations from my testing:

The 26B MoE model is the sleeper hit for most people. It only activates 4 billion parameters per forward pass, which means it runs almost as fast as a small model while delivering quality close to the 31B. If you're on a 16GB machine (which is a lot of MacBooks and most consumer GPUs), the 26B at 4-bit quantization is probably your best option.

The edge models (E2B, E4B) are genuinely usable on phones. Google demoed them on Pixel devices and the Jetson Orin Nano. If you have a Raspberry Pi 5 with 8GB, the E2B runs on it. Not fast, but it runs.

For Apple users specifically: if you're buying a Mac and you know you want to run local models, spend the money on RAM, not GPU cores. Going from 24GB to 48GB unified memory matters more than going from M4 Pro to M4 Max at the same RAM. The bandwidth jump from Pro to Max is significant (273 vs 546 GB/s on M4), but having the model fit in memory at all is the first thing that matters.

Agent workflows, not just Q&A

This is the part that got me excited when I was testing the preview.

Gemma 4 has native function calling, structured JSON output, and system instructions baked into the training. When you define tools for the model, it picks the right one, formats the arguments correctly, and handles the response. This isn't some prompt hack. It's trained behavior.

Why this matters: if you're building agents that chain API calls, even a 5% failure rate on JSON formatting will ruin your day. The model hallucinates a field name, your parser throws, the whole chain dies.

A friend of mine, Ania, does ML engineering at a Krakow startup. She had a data pipeline agent running on Llama 3 and couldn't stop the model from wrapping JSON in markdown blocks, dropping fields, or making up parameter names. She's been on Gemma 4's 26B since the preview and says her agent failure rate went from something like 15% down to under 2%. One data point, sure. But my own testing lines up. The structured output works more consistently than anything else I've tried at this size.

Picking between Gemma 4, Llama 4, and Qwen 3.5

I'll keep this simple because I've seen too many 3000-word comparison posts that don't actually help anyone decide.

Go with Gemma 4 if licensing matters (Apache 2.0, zero restrictions), you need edge or mobile deployment, your work is math-heavy or reasoning-focused, or you want one model family that covers everything from phones to workstations.

Go with Llama 4 if you need absurd context length (Scout does 10M tokens), you're building general-purpose chat, and the 700M MAU cap plus EU restrictions don't affect you.

Go with Qwen 3.5 if coding is your primary use case, you need serious multilingual support (201 languages), or you want the widest range of MoE model sizes from 0.8B all the way to 397B.

Personally, if I'm starting a new commercial project today, I'd go with Gemma 4. Mostly because of the license. The performance is a nice bonus but the licensing clarity is what actually unblocks you.

What this says about the open model space

Google putting Gemma 4 under Apache 2.0 is a statement. It puts pressure on Meta, Alibaba, and everyone else shipping models with custom licenses full of asterisks. Developers are tired of reading terms of service to figure out if they can actually ship a product.

The gap between open and proprietary models keeps closing too. Top open models are within 5-10 points of the best proprietary APIs on most benchmarks now. Add in privacy, latency, offline capability, and cost savings from local inference, and open models look better every few months.

Gemma 4 won't win every benchmark. It's not the best coding model or the best chat model. But it's competitive, it's properly open, and it runs on hardware that a lot of developers already own.

If you've been thinking about running models locally — whether for RAG pipelines, agent architectures, or just experimenting — this is a solid place to start.


Pull Gemma 4 right now with ollama run gemma4 and try it on your own stuff. Models are on Hugging Face, Kaggle, Ollama, and Google AI Studio.

Pawel Owerczuk
Pawel Owerczuk

AI Agent & RAG Developer

AI Agent & RAG Developer with 10+ years of software engineering experience. Specialized in intelligent AI solutions for enterprises in the DACH & Nordic region.