Uncategorized

The Three Models That Matter Right Now: GLM-5.1 vs Qwen3.6 Plus vs Gemma 4 31B

While researching the subject, I couldn’t find a proper comparison of the three top models. The data for each one is available independently, but no overview is. I asked an LLM to write me a report on the situation, and I thought it would be helpful for others if I posted it here:

The frontier model race has a new problem: it’s no longer obvious who’s winning.

For most of 2025, the leaderboard was straightforward. OpenAI and Anthropic traded blows at the top. Everyone else competed for third. But April 2026 has broken that pattern. Three models — each from a different continent, each built on a different philosophy — have arrived within weeks of each other, and they’re not neatly rankable.

GLM-5.1 (Z.AI, China) tops SWE-Bench Pro and claims the #10 spot on BenchLM. Qwen3.6 Plus (Alibaba, China) beats Claude on terminal coding benchmarks and gives you a million tokens of context for free. Gemma 4 31B (Google, US) is a 31-billion-parameter model that outperforms rivals twenty times its size — and you can run it on a single RTX 4090.

None of them dominates every category. All of them dominate at least one.

I spent time pulling data from Z.AI’s official blog, BenchLM, Google’s model card, and third-party evaluations to build a clear picture. Here’s what the numbers actually say — and what they mean for developers choosing a model today.

The Contenders at a Glance

Before the benchmarks, context matters. These three models were built for different things.

GLM-5.1 is Z.AI’s flagship, a 744B mixture-of-experts model (40B active parameters) with an MIT license. Released April 7, 2026. It was designed for long-horizon autonomous tasks — the kind where an agent works for hours, running its own tests, fixing its own bugs, iterating without human intervention. Its headline achievement: 8-hour sustained execution on a single task, including building a complete Linux desktop from scratch.

Qwen3.6 Plus is Alibaba’s latest, released April 2, 2026. It’s proprietary (no weights), but free during its preview period. Its selling point is breadth: a 1M token context window, native multimodal support, and competitive scores across almost every category. It’s the most general-purpose model in this comparison.

Gemma 4 31B is Google DeepMind’s answer to the question “what if we made something small that punches like something huge?” Released April 2, 2026 under Apache 2.0. It’s a dense 31B parameter model — no mixture-of-experts, no tricks — that somehow scores 89.2% on AIME 2026 competition math. You can download it, fine-tune it, and run it locally on consumer hardware.

	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B
Provider	Z.AI	Alibaba	Google
Parameters	744B MoE (40B active)	Not disclosed	31B dense
Context window	203K	1M	256K
Price (per 1M tokens)	$1.40 / $4.40	Free (preview)	Free (open weights)
License	MIT	Proprietary	Apache 2.0
Can self-host?	Yes	No	Yes
Fits on 1 GPU?	No	No	Yes (1× RTX 4090)
BenchLM rank	#10 (score: 84)	#23 (score: 77)	#24 (score: 74 est.)
Arena Elo	1467	N/A	1451

Reasoning & Knowledge: The PhD Test

If you’re building a research assistant, a scientific analysis tool, or anything that needs to reason like a graduate student, these are the benchmarks that matter.

Benchmark	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B	Winner
GPQA-Diamond	86.2	90.4	84.3	Qwen3.6 Plus
HLE (no tools)	31.0	28.8	19.5	GLM-5.1
HLE w/ Tools	52.3	50.6	26.5	GLM-5.1
MMLU-Pro	~82	88.5	85.2	Qwen3.6 Plus

What jumps out: Qwen3.6 Plus dominates GPQA-Diamond at 90.4% — a PhD-level science reasoning benchmark. That’s 4 points above GLM-5.1 and 6 points above Gemma 4. For knowledge-heavy workflows, Qwen is the pick.

But GLM-5.1 fights back on Humanity’s Last Exam (HLE), one of the hardest benchmarks in existence. Its 31.0% without tools and 52.3% with tools both lead this group. HLE is designed to be brutal — it tests expert-level reasoning across dozens of domains. The fact that GLM-5.1 is 6 points ahead of Gemma 4 here (31.0% vs 19.5% without tools) tells you these models are not in the same tier for deep reasoning.

Gemma 4’s numbers are respectable for a 31B model — 84.3% on GPQA-Diamond is remarkable given its size — but it’s clearly a step behind the two larger models on pure knowledge tasks.

Math: Where Giants Trade Blows

Competition math is where reasoning models separate themselves. AIME (American Invitational Mathematics Examination) problems require multi-step proofs and creative problem-solving that go far beyond arithmetic.

Benchmark	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B	Winner
AIME 2026	95.3	95.1	89.2	GLM-5.1
HMMT Nov 2025	94.0	94.6	—	Qwen3.6 Plus
HMMT Feb 2026	82.6	87.8	—	Qwen3.6 Plus
IMOAnswerBench	83.8	83.8	—	Tie

GLM-5.1 and Qwen3.6 Plus are nearly inseparable on AIME 2026 — 95.3 vs 95.1 is within measurement noise. But Qwen pulls ahead on the HMMT (Harvard-MIT Mathematics Tournament) benchmarks, particularly February 2026 where it leads by over 5 points. If you need a math workhorse, Qwen3.6 Plus has the edge on harder competition problems.

Gemma 4 at 89.2% on AIME 2026 is the real story though. That’s a 31-billion-parameter model scoring within 6 points of models with 20–25× more parameters. For anyone running local inference, this is the best math model you can fit on consumer hardware — by a wide margin.

Coding & Software Engineering: The Real Differentiator

This is where the three models tell completely different stories.

Benchmark	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B	Winner
SWE-Bench Pro	58.4	56.6	—	GLM-5.1
SWE-Bench Verified	77.8	78.8	—	Qwen3.6 Plus
Terminal-Bench 2.0	63.5	61.6	—	GLM-5.1
NL2Repo	42.7	37.9	—	GLM-5.1
LiveCodeBench v6	—	87.1	80.0	Qwen3.6 Plus
Codeforces Elo	—	—	2150	—

GLM-5.1 holds the global #1 spot on SWE-Bench Pro at 58.4, beating GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). This isn’t a niche benchmark — SWE-Bench Pro measures whether a model can autonomously fix real GitHub issues in production repositories. GLM-5.1 was specifically architected for this: it can run for hours, execute tests, read error messages, and iterate on its own fixes.

Qwen3.6 Plus edges ahead on SWE-Bench Verified and dominates LiveCodeBench v6 at 87.1%. It also made headlines by beating Claude Opus 4.5 on Terminal-Bench 2.0 (61.6 vs 59.3) — the first time any model has done that on a terminal automation benchmark.

Gemma 4’s 80.0% on LiveCodeBench v6 and 2150 Codeforces Elo are extraordinary for a model you can run locally. But it doesn’t have published SWE-Bench numbers yet, which makes it harder to compare on real-world software engineering.

The bottom line for coding: If you need an autonomous coding agent that can work for hours on complex repos, GLM-5.1 is unmatched. If you need a strong API-powered coding assistant, Qwen3.6 Plus gives you more for free. If you need to run everything locally, Gemma 4 is your only option in this group.

Agentic Capabilities: The New Frontier

Agentic benchmarks — measuring a model’s ability to use tools, browse the web, navigate terminals, and complete multi-step tasks — are where 2026’s model race is actually being decided.

Benchmark	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B	Winner
BrowseComp (w/ context)	68.0	—	—	GLM-5.1
MCP-Atlas Public	71.8	48.2	—	GLM-5.1
Claw-Eval (pass³)	—	58.7	—	—
CyberGym	68.7	—	—	GLM-5.1
τ2-bench (agentic)	—	—	86.4	—

GLM-5.1 is clearly the agentic leader. Its MCP-Atlas score of 71.8% absolutely dwarfs Qwen3.6 Plus’s 48.2%. On BenchLM’s agentic category, GLM-5.1 scores 82.8/100 versus Qwen’s 71.6/100. This aligns with Z.AI’s design philosophy: GLM-5.1 was built to work autonomously for hours, executing code, reading outputs, and iterating without human intervention.

Gemma 4 has a strong τ2-bench score (86.4%) but lacks coverage on most agentic benchmarks. Given its size, it’s impressive it can do tool use at all — but it wasn’t designed for multi-hour autonomous workflows.

Multimodal: Qwen’s Territory

This is the most lopsided category in the comparison, and it’s not close.

Benchmark	GLM-5.1	Qwen3.6 Plus	Gemma 4 31B	Winner
MMMU Pro	—	86.0	76.9	Qwen3.6 Plus
RealWorldQA	—	85.4	—	Qwen3.6 Plus
OmniDocBench v1.5	—	91.2	—	Qwen3.6 Plus
Video-MME (w/ sub)	—	87.8	—	Qwen3.6 Plus
MATH-Vision	—	—	85.6	—

GLM-5.1 has no published multimodal benchmarks. Gemma 4 has limited coverage (76.9% on MMMU Pro). Qwen3.6 Plus sweeps every multimodal benchmark in this comparison — and it’s competitive with Claude Opus 4.5 and Gemini 3 Pro on these same tests.

If your workflow involves images, documents, video, or visual reasoning, Qwen3.6 Plus is the only model in this group you should consider. It’s not just the best here by default — its 91.2% on OmniDocBench (document understanding) leads all models globally, not just these three.

The Price-Performance Question

Here’s where the comparison gets interesting for anyone with a budget.

Qwen3.6 Plus is free right now. Zero dollars. A million-token context window. Strong across almost everything. The catch: it’s proprietary, preview pricing won’t last forever, and you can’t self-host.
Gemma 4 31B is also free — permanently. Apache 2.0 license. Download it, fine-tune it, deploy it on your own infrastructure. The trade-off: it’s 10+ points behind the larger models on several benchmarks and lacks coverage on agentic and SWE-Bench tests.
GLM-5.1 costs $1.40/$4.40 per million tokens. It’s the most expensive here. But it’s also the highest-ranked model in this group (#10 globally), open-weight under MIT, and dominates the benchmarks that matter most for autonomous coding agents.

If you’re a startup building an agentic coding tool, GLM-5.1’s $4.40/M output tokens will add up quickly — but its SWE-Bench Pro leadership might make it worth every penny. If you’re a researcher running experiments on your own hardware, Gemma 4 gives you 89% AIME performance for the cost of electricity. If you’re prototyping and don’t want to spend anything, Qwen3.6 Plus is a no-brainer.

So Which One Should You Use?

I hate “it depends” conclusions as much as anyone. So here’s a decision framework that actually helps:

Choose GLM-5.1 if:

You need autonomous coding agents that work for hours without supervision
You care about SWE-Bench Pro performance (it’s #1 globally)
You need strong agentic capabilities (MCP, browsing, cybersecurity)
You want open weights with a permissive license
You’re building production coding infrastructure

Choose Qwen3.6 Plus if:

You need multimodal understanding (documents, images, video)
You want the largest context window available (1M tokens)
You’re price-sensitive (free during preview)
You need strong math performance, especially on HMMT-style problems
You want the strongest GPQA score in this group (90.4%)

Choose Gemma 4 31B if:

You need to run models locally on consumer hardware
You want full control over your inference pipeline
You’re doing math-heavy work where 89.2% AIME is sufficient
You need a permissive license (Apache 2.0) for commercial use
You’re building edge or on-device AI applications
Latency and privacy matter more than peak benchmark scores

The Bigger Picture

What’s genuinely interesting about this comparison isn’t any single benchmark. It’s that the three models represent three completely different bets on the future of AI:

GLM-5.1 bets that the future is autonomous agents — models that don’t just answer questions but execute entire workflows. Its 8-hour sustained execution capability is something no other model in this class offers.

Qwen3.6 Plus bets that the future is breadth — one model that can do everything reasonably well, for free, with enough context to process entire codebases or document libraries in a single prompt.

Gemma 4 31B bets that the future is local — that the most important model is the one you can run yourself, on your own hardware, without sending data to anyone.

All three bets are probably right for different people. The era of one model to rule them all is over. The era of picking the right tool for the right job has begun.

All benchmark data sourced from Z.AI’s official blog (z.ai/blog/glm-5.1), BenchLM.ai, Google DeepMind’s Gemma 4 model card, and third-party evaluations from Artificial Analysis and LayerLens. Data current as of April 12, 2026. Some benchmarks lack coverage for certain models — dashes indicate no published score was available.