NVIDIA Nemotron 3 Super: The Open-Weight Model That Beats GPT-4 on Code

NVIDIA released Nemotron 3 Super at GTC 2026 and immediately set a new record. The model scored 60.47% on SWE-Bench Verified, the highest score ever recorded by an open-weight model on the benchmark that measures autonomous resolution of real GitHub issues. It has 120 billion total parameters and activates only 12 billion at inference. It delivers 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B in high-volume settings. The weights are public under NVIDIA’s Open Model License.

The benchmark result is striking. The company behind the result is more interesting than the number.

NVIDIA sells GPUs. It does not need to compete in the model race to win the AI market. The fact that it released a state-of-the-art open-weight model anyway tells you something precise about its strategy.

The Architecture

Nemotron 3 Super uses three architectural innovations that differentiate it from standard transformer-based models.

The backbone interleaves Mamba-2 layers with transformer attention layers in what NVIDIA calls a Hybrid Mamba-Transformer architecture. Mamba-2 layers are state-space models that process sequences with linear complexity relative to sequence length rather than the quadratic scaling of standard attention. This is the mechanism behind the 1-million-token context window: standard attention would be prohibitively expensive at that length, but Mamba layers handle long sequences at constant memory per token. Transformer layers handle the reasoning operations that benefit from full attention over the entire context.

On top of this backbone sits a Latent Mixture-of-Experts design. Standard MoE models activate a subset of experts per token. Nemotron 3 Super’s Latent MoE activates four expert specialists for the inference cost of one by routing through latent representations rather than full expert computation. This is how the model achieves 12 billion active parameters while maintaining accuracy competitive with much larger dense models.

The third element is Multi-Token Prediction, which predicts multiple output tokens simultaneously rather than one at a time. MTP delivers roughly 3x faster inference on Blackwell GPUs using NVFP4 precision according to NVIDIA’s benchmarks.

Training used synthetic data generated from frontier reasoning models. NVIDIA published the complete methodology, over 10 trillion tokens of pre- and post-training datasets, and 15 reinforcement learning training environments alongside the model release. The post-training data has a cutoff of February 2026. The release is reproducible in a way that many open-weight releases are not.

What SWE-Bench Verified Actually Measures

SWE-Bench Verified tests whether a model can autonomously resolve real GitHub issues pulled from production repositories. Each issue comes with the repository code, the issue description, and a test suite. The model must produce a code patch that passes the tests without being told what to change or how. The evaluation uses OpenHands scaffolding and NVIDIA’s NeMo Evaluator SDK.

A 60.47% score means Nemotron 3 Super can resolve more than 60 out of every 100 real GitHub issues without human intervention. The best closed models, including Claude Opus 4.6 and GPT-5.x variants, score in the 60 to 70% range depending on evaluation configuration. Nemotron 3 Super, running on hardware you own, sits in that range.

That threshold matters for enterprise deployment. A model resolving 60% of real GitHub issues autonomously can replace meaningful fractions of routine engineering work inside a secure network. No data leaves the organization. No API call goes to OpenAI’s or Anthropic’s servers. Aerospace contractors, financial institutions, and healthcare IT teams operating under data residency requirements can deploy this model where they cannot deploy cloud APIs.

Early adopters include Perplexity, CodeRabbit, Factory, Greptile, Palantir, Cadence, Dassault Systemes, and Siemens. The roster skews heavily toward enterprise infrastructure and engineering tools rather than consumer applications. The model also holds the top position on DeepResearch Bench and DeepResearch Bench II, measures of multi-step research capability across large document sets.

NVIDIA’s Strategic Logic

NVIDIA makes money when people buy and deploy GPUs. A 120-billion-parameter model that achieves frontier-tier coding performance creates a compelling reason to buy more H100s, H200s, or Blackwell B200s. NVIDIA does not need the model to generate direct revenue. The model generates GPU demand.

This explains why NVIDIA published extensive training methodology, released 10 trillion tokens of training data, and distributed the model under a relatively permissive license. The more widely Nemotron 3 Super is adopted, the more GPU clusters organizations need to run it. Fine-tuning requires compute. Inference at scale requires compute. Every downstream use case is a potential sale for the hardware division.

The strategy puts OpenAI and Anthropic in an unusual position. Their closed models compete not just with each other but with open-weight models distributed by one of their primary infrastructure suppliers. Criticizing Nemotron 3 Super means criticizing NVIDIA, which simultaneously supplies the compute that OpenAI and Anthropic depend on to run their own products. NVIDIA benefits regardless of which lab wins the model competition as long as the competition drives GPU demand higher.

The comparison to Google giving away Android to sell search advertising is imperfect but instructive. NVIDIA is not giving away the model in a fully free sense: the hardware required to run it costs money. But the model itself is free, and the value flows through hardware purchases rather than model subscriptions. For organizations with existing NVIDIA infrastructure, adopting Nemotron 3 Super has near-zero incremental software cost.

Limitations Worth Noting

The SWE-Bench Verified score was measured using OpenHands scaffolding and NVIDIA’s NeMo Evaluator SDK. NVIDIA has released the evaluation methodology for reproducibility, but independent third-party verification at this specific score has not yet been published at the time of writing. The throughput benchmarks of 2.2x over GPT-OSS-120B and 7.5x over Qwen3.5-122B come from NVIDIA’s own testing under conditions that have not yet been independently replicated.

The license is NVIDIA Open Model License, not Apache 2.0 or MIT. It permits commercial use and fine-tuning but includes restrictions that differentiate it from fully open-source licensing. Specifically, it restricts use for training models that compete with NVIDIA’s commercial products. Enterprise legal teams should review the full license terms before building proprietary products directly on Nemotron weights.

At 120 billion total parameters, running Nemotron 3 Super on a single consumer GPU requires aggressive quantization that degrades output quality. Practical production deployments target multi-GPU server clusters. The on-premises advantage is real, but it requires on-premises server hardware that not every organization has or can quickly acquire. NVIDIA’s NIM containers are the intended production deployment path.

The efficiency work published in early 2026, including the KV cache quantization research covered here (Google TurboQuant Compresses LLM Memory by 6x), reduces the memory cost of running large models. A 6x reduction in KV cache memory requirements translates directly into the ability to serve Nemotron 3 Super on fewer or cheaper GPUs, which is positive for the on-premises deployment use case.

What Happens Next

Three signals to watch over the next two quarters.

First, whether NVIDIA ships smaller Nemotron models targeting edge devices or workstation hardware. The 120-billion-parameter count limits addressable market to server deployments. A smaller Nemotron targeting M-series MacBooks or single-GPU workstations would expand competitive pressure on OpenAI and Anthropic into segments where cloud APIs currently have no competition.

Second, whether OpenAI or Anthropic respond with their own open-weight releases. If they do, throughput becomes the tiebreaker. NVIDIA hardware has a structural advantage on throughput for models designed to run on NVIDIA silicon. If they do not respond, NVIDIA owns the on-premises enterprise coding segment without meaningful competition.

Third, whether the training methodology produces further improvements in subsequent model generations. NVIDIA released the 15 reinforcement learning training environments and evaluation recipes. Other labs can replicate and build on this approach. The next generation of open-weight coding models will likely treat Nemotron 3 Super’s architecture as a baseline rather than a ceiling.

The chip vendor that ships the best model complicates the AI industry’s revenue assumptions. It does not break them, but it sets a performance floor that closed commercial models must exceed to justify their pricing. That floor is currently 60.47% on SWE-Bench Verified. It will rise. The interesting question is whether it rises because of NVIDIA’s next model release or because OpenAI and Anthropic respond with open-weight releases of their own.

My Written Word

Qwen 3.5 9B Matches Models 13x Its Size: What Small Models Mean for Edge AI

Apple’s AI Reckoning: Why Siri Runs on Google’s Gemini Now

The AI Supply Chain Is the New Attack Surface: From Ultralytics to LiteLLM

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Thank you for your response. ✨

My Written Word

My Written Word

NVIDIA Nemotron 3 Super: The Open-Weight Model That Beats GPT-4 on Code

The Architecture

What SWE-Bench Verified Actually Measures

NVIDIA’s Strategic Logic

Limitations Worth Noting

What Happens Next

Share this:

Qwen 3.5 9B Matches Models 13x Its Size: What Small Models Mean for Edge AI

Apple’s AI Reckoning: Why Siri Runs on Google’s Gemini Now

The AI Supply Chain Is the New Attack Surface: From Ultralytics to LiteLLM

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Thank you for your response. ✨

My Written Word