Nvidia Blackwell Ultra is 50 keer sneller voor inference

When Nvidia CEO Jensen Huang took the stage at GTC 2024, he made a bold promise: the new Blackwell architecture would deliver up to 30x performance gains compared to the previous industry standard, the Hopper H100. At the time, analysts and tech skeptics called it “marketing math.” They joked about reality distortion fields. But as the first independent benchmarks and deep-dive analyses roll in, it turns out the joke was on the skeptics.

Jensen didn’t overpromise; he underpromised. Recent data reveals that the Nvidia Blackwell Ultra, specifically within the GB300 NVL72 system, is delivering up to 50x better performance for specific AI inference workloads compared to Hopper. This isn’t just a generational leap; it is a complete paradigm shift in how we process artificial intelligence.

But how is this physically possible? How do you squeeze that much juice out of silicon? The answer lies not just in a faster chip, but in a fundamental rethinking of the data center, the mathematics of precision, and the software that orchestrates it all. Let’s pop the hood on this liquid-cooled beast and understand the mechanics of the 50x leap.

The Beast: What is the GB300 NVL72?

To understand the speed, you have to stop thinking about a GPU as a card you plug into a server. The Nvidia GB300 NVL72 is a rack-scale system. It acts as a single, massive graphics processing unit. It integrates 72 Nvidia Blackwell Ultra GPUs and 36 Arm-based Grace CPUs into a single, liquid-cooled rack.

The “Ultra” in Blackwell Ultra isn’t just branding. It represents a configuration designed specifically for the age of AI reasoning and Agentic AI. While the standard Blackwell is a powerhouse, the Ultra version is tuned for memory capacity and bandwidth, which are the lifeblood of running massive models like DeepSeek-R1 or GPT-4.

The magic glue holding this rack together is the fifth-generation NVLink. In a traditional setup, GPUs communicate over Ethernet or InfiniBand, which are fast but introduce latency. In the NVL72, all 72 GPUs communicate via NVLink with a staggering 900 GB/s of bidirectional bandwidth per GPU. This creates a “scale-up” domain where 72 chips function as one giant brain with unified memory. This architecture is the foundation upon which the 50x performance claim is built.

The “How”: Unpacking the 50x Speed Boost

Achieving a 50-fold increase in throughput per megawatt isn’t achieved by simply increasing the clock speed. It requires a combination of hardware innovation, mathematical shortcuts, and clever software engineering. Here are the three pillars of this performance explosion.

1. The Magic of NVFP4 and Micro-Scaling

For years, the industry standard for AI training and inference was FP16 (16-bit floating point) or BF16. Recently, we moved to FP8. Blackwell Ultra takes this a step further with native support for NVFP4—4-bit floating point precision.

In simple terms, if you can represent the weights of a neural network with fewer bits (4 instead of 16), you can move data four times faster and store four times as much model in the same memory. However, simply chopping off bits usually destroys the accuracy of the model. The output becomes garbage.

Nvidia solved this with Block Scaling and micro-scale formats. Instead of having one scaling factor for a massive tensor (a mathematical object describing the model’s data), NVFP4 breaks vectors down into tiny blocks. Each block has its own scaling factor. This allows the hardware to maintain high dynamic range and accuracy even at extremely low precision.

By using a two-level scaling strategy—combining block scaling with a high-precision tensor scaling factor—the Blackwell Ultra can run inference at 4-bit precision with accuracy comparable to 8-bit or 16-bit formats. This effectively doubles the compute throughput and memory bandwidth efficiency compared to FP8, without the model turning into a hallucination machine.

2. Disaggregated Prefill and Decode

The benchmarks revealing the 50x figure rely heavily on a software technique called Disaggregated Prefill. To understand this, we need to look at how an LLM works. It has two phases:

Prefill: The model reads your prompt. This is compute-heavy and parallel.
Decode: The model generates the answer token by token. This is memory-bandwidth bound and serial.

In traditional setups, the same GPU handles both. The heavy “prefill” requests clog up the system, slowing down the “decode” generation for everyone else. It’s like trying to drive a race car on a track that is also being used by heavy cargo trucks.

With the GB300 NVL72, inference providers use software stacks (like Nvidia Dynamo or open-source tools like SGLang) to separate these phases. One set of GPUs handles the heavy lifting of understanding the prompt (Prefill), and they instantly hand off the memory state (KV Cache) to other GPUs dedicated solely to generating answers (Decode). Because the NVL72 has such massive internal bandwidth, this handoff is instantaneous. This allows the system to run at 100% utilization without the “trucks” blocking the “race cars.”

3. Wide Expert Parallelism (WideEP)

Modern frontier models, such as DeepSeek-R1 or Mixtral, are “Mixture of Experts” (MoE) models. They are huge, but for any given word, only a small fraction of the “brain” is active.

Running these on standard hardware is inefficient because you have to load the whole model into memory, even if you only use 5% of it for a specific token. The GB300 NVL72 utilizes Wide Expert Parallelism. Because all 72 GPUs share a high-speed lane, the system can spread the “experts” (different parts of the neural network) across all 72 chips.

When a token needs a specific expert, it is routed instantly to the specific GPU holding that expert. This allows the system to keep the entire massive model in fast memory (HBM3e) and utilize the aggregate bandwidth of the entire rack. The result is a massive spike in throughput that smaller, non-rack-scale systems simply cannot physically match.

The Rise of Agentic AI

Why do we need this speed? If ChatGPT is already fast enough, who cares about 50x? The answer lies in the shift from Chatbots to Agents.

We are moving away from simple Q&A bots toward Agentic AI. An AI agent doesn’t just answer a question; it performs a job. It might reason through a complex legal document, write and debug an entire software module, or plan a travel itinerary by checking live APIs.

This requires two things that kill performance on older hardware:

Long Context Windows: To fix a bug, an agent needs to read the entire codebase (hundreds of thousands of tokens). The Blackwell Ultra offers 1.5x larger memory capacity and higher bandwidth to handle these massive inputs.
System 2 Reasoning: Models are now being designed to “think” before they speak (like the Chain-of-Thought process). This generates thousands of internal “thought tokens” before the user sees a single word.

For an agent to feel responsive, it needs to process these massive contexts and generate internal thoughts at lightning speed. On a Hopper H100, a complex agentic workflow might take minutes, breaking the user’s flow. On a Blackwell Ultra, utilizing the 50x throughput gains, that same workflow happens in seconds. This opens the door for real-time coding assistants that can refactor entire applications while you watch.

The Economics: The Jensen Math of Tokenomics

The most counterintuitive part of this technology is the cost. A rack of GB300 NVL72s is an incredibly expensive piece of hardware, likely costing millions of dollars. Yet, it reduces the cost of AI.

This is what analysts refer to as “Tokenomics.” If a system costs twice as much but produces 50 times more tokens per hour, the cost per token drops precipitously.

According to data from SemiAnalysis and Nvidia, the GB300 NVL72 delivers up to a 35x lower cost per million tokens compared to the Hopper platform for agentic workloads. This is the difference between an AI service being a money-losing experiment and a profitable business model.

Real-World Impact

We are already seeing this play out with early adopters:

Sully.ai: An AI “employee” for doctors that handles medical notes. By moving to Blackwell-optimized stacks, they reduced their inference costs by 10x and improved response times by 65%. This allowed them to return over 30 million minutes to physicians—time previously lost to data entry.
Latitude: The creators of AI Dungeon use massive open-source models to generate infinite storylines. By leveraging the NVFP4 format on Blackwell, they cut their cost per token by 4x, allowing them to offer smarter, more complex narratives without bankrupting the company.

The Competitive Landscape: Where is the Competition?

It is impossible to discuss these gains without looking at the competition. Independent benchmarks from SemiAnalysis show that while competitors like AMD are making strides with chips like the MI355X, they face a composability challenge.

While raw specs on paper might look competitive, the ability to combine all the advanced techniques, FP4 quantization, disaggregated prefill, and wide expert parallelism simultaneously is where Nvidia currently dominates. The software maturity of the CUDA ecosystem, combined with the hardware integration of the NVL72 rack, creates a moat that is currently difficult to cross. For high-end, low-latency agentic workloads, the rack-scale architecture of the GB300 is currently in a league of its own.

Looking Ahead: The Era of the AI Factory

The Nvidia Blackwell Ultra represents the industrialization of Artificial Intelligence. We are moving from the craftsman era of AI, where models were run on individual servers, to the factory era, where intelligence is generated at massive scale in dedicated, liquid-cooled facilities.

The 50x speed increase is not just a benchmark number; it is an enabler. It enables models that reason deeper, agents that work faster, and applications that were previously too expensive to run. As these systems come online, we can expect a rapid acceleration in the capabilities of the AI tools we use every day. The hardware is no longer the bottleneck; the limit is now our imagination in how to use this abundance of intelligence.