Inside MLPerf v6.0: AI's Diverging Training Stack

A record-diverse round of MLCommons results signals an industry settling on what to train, then splintering over how and where to train it.

MLPerf Training working group co-chair Shriya Rishab sees the field converging on a shared set of best practices for building models. At the same time, she sees the frameworks, silicon, and systems running those models pulling apart into something far more varied than a year ago.

The numbers behind MLPerf Training v6.0, released yesterday by MLCommons, make that tension concrete. The round logged 95 unique systems built on 13 different accelerators and 19 host processors. Sixty percent ran across multiple nodes. Two years ago, in the v4.0 round, multi-node systems were closer to a third of submissions. The default shape of an AI training system has changed, and it now mirrors how real data centers get built.

That breadth is the story this round, more than any single winning time. MLPerf has often been read as an NVIDIA scoreboard. This version reads as a map of genuine plurality: 229 performance results, roughly 1.2 times the prior round, from 24 submitting organizations. Four submitted for the first time, among them Vultr and Korea's TTA. "There are more ways of getting your AI training than ever before," said working group co-chair Pavan Yalamanchili.

The Cloud Moves to the Center

The sharpest shift sits in where training happens. Cloud systems more than doubled against the v5.1 round roughly six months earlier. The independent GPU clouds turned up in force, with CoreWeave, Lambda, Nebius, and Vultr submitting alongside hyperscalers Google, Azure, and Oracle. On-premises build-out has not slowed. What changed is that cloud-hosted training now stands as a credible path rather than a fallback.

Lambda offers one window into the pattern. Its bare-metal GB300 NVL72 run trained Llama 3.1 8B to target 18.7 percent faster than its previous best, 11.59 minutes against 14.25, and it posted an early result on the new GPT-OSS-20B workload using a single eight-GPU node. The takeaway is not the specific time. It is that a cloud provider tracked the newest hardware and a brand-new workload in the same round it shipped.

Providers including Nebius and ScitiX go further, arguing their virtualized or standardized environments now perform close to bare metal. That is a claim worth testing rather than taking on faith, and the benchmark is built to let buyers do exactly that.

Convergence on Sparse, Divergence on Precision

Look at the models, and the industry agrees. Two new benchmarks entered the suite this round, DeepSeek V3 at 671 billion parameters with 37 billion active per token, and GPT-OSS 20B at 21 billion parameters with 3.6 billion active. Both use a Mixture-of-Experts design, which routes each token to a small subset of specialized sub-networks so a large model activates only a fraction of itself per token. The two drew about 22 percent of submissions on debut. Sparse computation is now the shared architecture, and MLCommons is retiring its older dense models, with Llama 3.1 405B and the DLRM-DCNv2 recommender appearing for the last time.

Look at the math underneath, and the agreement dissolves. Submitters reached for competing four-bit precision recipes, NVIDIA's proprietary NVFP4 and the open MXFP4 standard, mostly in the dense linear layers of their runs and not yet in the MoE models. Yalamanchili called the spread of FP4 implementations "not surprising," a sign of an industry still exploring what works. Convergence on the model, divergence on the precision. That split is the technical signature of an efficiency race that has not settled.

New Entrants, More Transparency

Two smaller developments point to where the next rounds go. KRAI benchmarked Isambard AI, the UK's National AI Research Resource, in what it believes is the first sovereign infrastructure to appear in MLPerf. National compute programs now want public, comparable numbers too. And MLCommons began disclosing the precision and parallelism behind each result, optional this round and mandatory later, so buyers can read past a single headline figure to the choices that produced it.

The models are consolidating. The infrastructure beneath them keeps multiplying. For anyone buying or building AI training capacity, that turns MLPerf from a ranking into a map, with more accelerators, more clouds, and more precision recipes, each a real choice carrying real tradeoffs. The question for the next round is which of these roads widen and which quietly close.

Inside MLPerf v6.0: AI's Diverging Training Stack

The Cloud Moves to the Center

Convergence on Sparse, Divergence on Precision

New Entrants, More Transparency

Subscribe to Our Newsletter