Open-Weight LLM Rankings, April 2026: MMLU Is Saturated, Here’s What to Use Instead

Open-Weight LLM Rankings, April 2026: MMLU Is Saturated, Here’s What to Use Instead
Open-Weight LLM Rankings, April 2026: MMLU Is Saturated, Here’s What to Use Instead

The open-weight model ecosystem in April 2026 looks nothing like it did eighteen months ago. In late 2024, the question was whether open models could approach proprietary frontier quality. That question is settled. On the benchmarks that distinguish capable from capable-enough, six labs now field open-weight models that match or exceed what GPT-4 and Claude 3.5 produced twelve months ago. The question now is which open model to choose, for which task, at what hardware cost, under what license. The answers are not obvious, and most coverage of the open-weight LLM field in April 2026 misreports the leaderboard by treating MMLU as a meaningful differentiator. MMLU is saturated. The models that matter differ on the benchmarks that are harder to game.

This analysis covers the current state of the open-weight LLM race as of late April 2026, with benchmark data sourced from multiple independent leaderboards, licensing and hardware requirements, and the specific task profiles where each model is the right choice.

The Benchmark That Matters Now: MMLU Is No Longer the Signal

MMLU, the Massive Multitask Language Understanding benchmark, was the primary differentiator between models from 2021 through early 2025. At 88-94% for the current frontier, it no longer distinguishes anything meaningful. Llama 4 Maverick leads all open models on MMLU at 85.5%, but that number tells you almost nothing useful about how it performs on the tasks that determine whether a model is appropriate for production deployment.

The benchmarks that provide real signal in April 2026 are SWE-bench Verified for software engineering tasks (measures real GitHub issue resolution, not synthetic code completion), GPQA Diamond for scientific reasoning at doctoral level, and AIME 2025 for mathematical reasoning. NVIDIA’s RULER benchmark provides a separate measurement: how much of an advertised context window is actually reliable. The answer across all current models is roughly 50-65%. A model claiming a 1-million-token context window reliably uses 500,000 to 650,000 of those tokens before retrieval quality degrades. For production agent deployments that depend on long-context memory, this effective context boundary is the number that matters, not the headline figure.

The Overall Leaders: Chinese Labs Dominate the Top Five

The BenchLM.ai leaderboard for April 2026 shows DeepSeek V4 Pro at 87 overall, Kimi K2.6 at 86, GLM-5 Reasoning and GLM-5.1 at 83, and Qwen 3.5 397B at 79. The top five open-weight models are all from Chinese labs: DeepSeek (Beijing), Moonshot AI / Kimi (Beijing), Zhipu AI / GLM (Beijing), and Alibaba / Qwen (Hangzhou). Meta’s Llama 4 family, the default reference model for US open-weight development, sits at 43 on the same scale. This is not a close race at the top.

Understanding why requires looking at the architecture choices the Chinese labs made in the 2025-2026 model generation. Mixture-of-Experts has become the dominant architecture. Qwen 3.5 397B total parameters activates only 17 billion per forward pass. GLM-5 uses a similar MoE structure. DeepSeek V4 Pro uses MoE with an innovative routing scheme. The practical consequence is that a 397B-parameter Qwen 3.5 model has the inference latency and GPU memory footprint of roughly a 17B dense model during inference, while benefiting from 397B parameters of accumulated knowledge during forward passes. Meta’s Llama 4 also uses MoE, but with different parameter counts and routing strategies. The Alibaba Token Hub restructuring that produced the Qwen 3.6-Plus family shows how concentrated release velocity from a coordinated AI unit affects benchmark position.

By Task Category: Which Model for Which Job

The benchmark spread across tasks is substantial enough that general rankings mislead. Choosing the best open model without specifying the task type is the wrong question.

For software engineering and code generation, the coding agent architecture matters as much as the model, but for raw model capability, MiniMax M2.5 leads SWE-bench Verified at 80.2%, matching Claude Opus 4.6 at 80.8%. GLM-5.1 scores 77.8% on SWE-bench Verified. Kimi K2.5 leads HumanEval at 99.0% and LiveCodeBench at 84.9%. For practical GitHub issue resolution rather than synthetic benchmarks, MiniMax M2.5 and GLM-5.1 are the current open-weight leaders.

For scientific reasoning at doctoral level, Qwen 3.5 leads GPQA Diamond at 88.4%, followed by Kimi K2.5 at 87.6% and GLM-5 at 86.0%. GPQA Diamond tests physics, chemistry, and biology questions at the level that doctoral students find difficult. Performance here correlates with reliable answers to complex analytical questions in regulated domains like healthcare and legal.

For mathematical reasoning, DeepSeek V3.2-Speciale achieved gold-medal performance at IMO, IOI, and ICPC 2026. Kimi K2.5 leads MATH-500 at 98.0%. The DeepSeek and Kimi families lead on multi-step mathematical proof tasks where showing work and maintaining consistency across long reasoning chains matters most.

For general-purpose chat with multilingual requirements, Qwen 3.5 supports 200 languages and dialects and scores 86.7% on MMLU alongside its leading GPQA Diamond performance. Llama 4 Maverick posts the highest raw MMLU at 85.5% among open models, and its 10-million-token context window through the Scout variant is unmatched for long-document analysis.

For efficiency on consumer and edge hardware, Gemma 4 26B MoE runs at 85 tokens per second on a consumer GPU while fitting in 14 GB of memory. The Qwen 3.5-35B-A3B variant activates only 3 billion parameters per forward pass, running at speeds and memory footprints comparable to a 3B dense model. Mistral Small 4 at 6 billion active parameters combines Devstral’s agentic coding capabilities in a package that runs on a single high-end consumer GPU under Apache 2.0 license.

License Analysis: Where Open Gets Complicated

The license picture is more fragmented than most coverage acknowledges, and the differences have real production implications.

Apache 2.0 is the most permissive option: full commercial use, modification, fine-tuning, and redistribution without royalties, usage caps, or geographic restrictions. Current Apache 2.0 models include Qwen 3/3.5, Gemma 4, and Mistral Small 4. The switch to Apache 2.0 for Mistral’s models in 2026 is significant because Mistral’s prior custom license restricted certain commercial uses.

MIT license provides similar freedoms to Apache 2.0 with fewer explicit patent grants. DeepSeek releases under MIT. GLM-5.1 uses MIT. For most practical purposes, MIT and Apache 2.0 are equally permissive for commercial deployment, though legal teams in some industries prefer Apache 2.0 for its explicit patent grant.

Meta’s Llama license restricts use above 700 million monthly active users. For most organizations this restriction is irrelevant in practice. For large-scale consumer products, it is not. The Llama license also prohibits training other models on outputs generated by Llama without specific provisions. This distillation restriction matters for teams building model training pipelines.

Custom licenses from Chinese labs require careful reading. Geographic restrictions, commercial deployment limitations, and prohibitions on competitive use appear in some Chinese lab licenses with inconsistent specificity. GLM-5 under MIT is clean. Some earlier Zhipu AI and Moonshot models had more restrictive terms. Always verify the current license version before deploying, because these labs update license terms with model updates.

The Effective Context Window Problem

The RULER benchmark finding that models use only 50-65% of their advertised context window reliably is one of the most practically important benchmarks not covered in most model comparison articles. The headline context window is a maximum capacity number, not a reliable performance number. Performance degrades significantly beyond the effective threshold.

Llama 4 Scout advertises 10 million tokens and reliably uses approximately 5-6.5 million. DeepSeek V4 claims 1 million tokens and reliably performs at 500,000-650,000. Qwen 3.5’s 256,000 effective context is what teams building RAG pipelines should plan around, not 500,000. For the context collapse failure mode that accounts for 31% of agent pilot failures, this effective context boundary is where the failure begins. Teams that design agent workflows assuming the full advertised context window consistently encounter degradation when workflows approach the effective boundary.

The Cost Calculation: API vs. Self-Hosted

LLM API prices dropped approximately 80% from 2025 to 2026 across major providers. At current prices, self-hosting a 400B+ parameter model costs $2,000-5,000 per month in cloud GPU compute, which only produces cost savings versus API pricing above roughly 50 million tokens per month. Below that volume, API pricing is cheaper than the fixed infrastructure cost even for open-weight models.

The cost calculation changes for organizations with data sovereignty requirements. Financial services, healthcare, defense, and regulated industries that cannot send data to external APIs have no volume threshold calculation to make. They self-host or they do not deploy. For these organizations, the Apache 2.0 and MIT licensed models from the current top tier represent the most capable options without regulatory risk. The KYA governance framework from MetaComp addresses the deployment compliance layer above the model selection layer, but the model selection itself begins with license verification.

The Infrastructure to Actually Run These Models

Three tools dominate the practical self-hosting stack in April 2026. Ollama handles local model running with one command per model, automatic GPU memory management, and an OpenAI-compatible REST API. It works on macOS with Apple Silicon and Linux with NVIDIA or AMD GPUs. For development and prototyping, one command is the right abstraction. For production, vLLM provides continuous batching, PagedAttention, and throughput optimization for multi-user serving. LM Studio provides a GUI for Windows, macOS, and Linux that non-developers can use for local model access without command-line knowledge.

Q4_K_M quantization reduces model memory requirements by 50-60% with measured quality loss of 1-3% on most benchmarks. The Qwen 3.5-35B-A3B model, quantized to Q4_K_M, runs on a single RTX 4090 while maintaining competitive benchmark performance. At 4.2 GB for Gemma 3 4B, the smallest capable models run on hardware that organizations already own without additional GPU purchases.

What the Benchmark Convergence Actually Means

The convergence of open-weight and proprietary model performance on coding benchmarks specifically is worth examining. MiniMax M2.5 at 80.2% on SWE-bench Verified, matching Claude Opus 4.6 at 80.8%, represents a genuine closing of the gap that most observers expected to take until at least 2027 based on the 2024 trajectory. The gap closed faster than predicted for the same reason that DeepSeek R1 appeared earlier than expected: architectural innovation (MoE routing, better training data curation, improved RL fine-tuning for reasoning) is producing capability gains faster than raw compute scaling.

This benchmark convergence has a direct implication for teams deciding whether to build on proprietary APIs or open-weight models. At equal capability on coding tasks, the decision reduces to: API convenience and managed infrastructure versus data sovereignty, cost at volume, and license flexibility. For most new production agent deployments, the capability gap that once made the API case compelling has effectively closed on coding and scientific reasoning. MMLU at 85.5% for Llama 4 Maverick versus 88% for GPT-5.4 used to be a meaningful capability difference. At today’s absolute performance levels, it is not the right metric to drive architecture decisions. The metrics that matter for production deployments are SWE-bench Verified for engineering tasks, GPQA Diamond for analytical reasoning, effective context under RULER for long-document workflows, and license terms for compliance. On those four dimensions, the open-weight ecosystem in April 2026 is a serious production option for most use cases where it was not twelve months ago.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading