
MLPerf Training v5.1: GenAI Drives, Scale Ramps, Field Widens
MLCommons’ MLPerf Training v5.1 lands with three clear signals: generative AI continues to shape the benchmark mix, scaling discipline is where many teams are winning time, and the roster of credible submitters is getting broader. This round includes 65 unique systems from 20 organizations spanning silicon, systems, clouds—and, notably, an academic HPC center.
“The view I see in MLPerf is like a Formula One race—same track and rules, room for tuning, and you see who can finish fastest,” said Chuan Li, Chief Scientific Officer at Lambda, who led the company's MLPerf v5.1 efforts. It’s a clever encapsulation of MLPerf’s value proposition: standardized tasks and target quality keep the contest honest while leaving space for technique.
What’s new in v5.1 is squarely aimed at today’s workloads. Llama 3.1 8B replaces BERT for LLM pretraining—a modern, decoder-only architecture that fits on a single node (≤8 accelerators) yet mirrors software patterns used at larger scales. On the image side, Flux.1 replaces Stable Diffusion v2, reflecting the shift to transformer-based diffusion models with cleaner validation via loss. Together, these swaps align the suite with the stacks enterprises are actually deploying.
Momentum and scale show up in the submission patterns. Multi-node entries climbed sharply versus a year ago, and genAI tests drew heavy participation: Llama 3.1 8B debuted with strong interest, while Llama 2-70B LoRA continued to be a favorite fine-tuning proxy. Performance trends outpaced a simple Moore’s Law line again; gains came not just from fresh silicon but from numerics, software, and fabrics—exactly the kind of full-system work buyers need to see.
NVIDIA posted most of the fastest times and largest-scale runs this round—especially on GB200/GB300 NVL72 configurations—reflecting stack maturity and intra-rack NVLink scale. Still, the broader story is ecosystem momentum: new entrants, academic participation, and software/networking gains that turned more multi-node runs into reproducible, closed-division results.
Standout Highlights: New Entrants
• University of Florida (academic first-timer): Ran across seven benchmarks on HiPerGator DGX B200, including multi-node scaling to 448 GPUs, demonstrating closed-division reproducibility on a shared HPC environment.
• Wiwynn (platform new entrant): Kinabalu posted Llama 2-70B LoRA results at 72 and 576 GPUs on GB200 NVL72, signaling readiness of its NVLink-centric design for fine-tuning workloads.
• Datacrunch (cloud first-timer): Brought up an 8× B200 Llama 3.1-8B run via Slurm/Pyxis “Instant Clusters,” positioning for fast, reproducible re-runs rather than one-off hero numbers.
Precision and practicality deserve a note. Several submitters leaned into lower-precision training (FP8 → FP4 variants) where numerically stable, but MLPerf’s rubric keeps that grounded: time-to-target-quality forces any optimization to actually converge. The other big lever is networking and topology—RDMA over InfiniBand or tuned Ethernet, clean hierarchies, and reliability at scale—because eight nodes only help if they act like eight, not three.
Lambda: a Rack That Behaves Like One Giant GPU
Lambda was one of a short list to post on GB300 NVL72 (72 Blackwell Ultra GPUs in a single NVLink domain). Two takeaways surfaced in my side interview with Chuan Li. First, the speed-ups split roughly half-and-half between hardware (more memory, higher inter-GPU bandwidth) and software (driver/library/framework maturation).
Second, numerics helped at the margin: moving from FP8 to an FP4 variant delivered an additional double-digit percentage improvement while still meeting the accuracy target. There’s also a practical lesson here: clean, converged runs at the edge of scale require weeks of lined-up capacity and tight coordination across DC ops, fabric, and software. Useful proof point—one of 20, not the whole story.
“We saw a 1.66x speedup in our Llama 2-70B run compared to previous submissions,” Li said. “This performance improvement showcases the power of the latest NVIDIA hardware, combined with Lambda’s cloud orchestration capabilities.”
How to Read the Tables Without Getting Lost
If you’re using v5.1 to guide purchasing or platform bets, a few simple rules help:
• Compare like for like. Start within the same benchmark and similar accelerator counts. A 32-GPU result and a 512-GPU result are not interchangeable.
• Look for scale curves, not just a single number. Do you see near-linear improvements from 8 → 16 → 32 → 64 GPUs? That often tells you more than a hero time.
• Check the software notes. Frameworks, kernels, parallelism strategy, IO/storage, and data pipelines are where much of the delta lives—and MLPerf links to them.
• Use Llama 3.1 8B as a quick stack sanity test. It’s modern, single-node accessible, and a good proxy before you commit to larger spend.
• If you care about image generation, Flux.1 is the new reality. Expect different stress points than SD v2 (attention/memory/diffusion schedule) and plan tuning accordingly.
• Treat FP4 wins as conditional on convergence. The fastest path that misses target quality doesn’t count in MLPerf—and it shouldn’t in production, either.
TechArena Take
This round isn’t just about newer GPUs; it’s about maturing engineering. The two new tests (Llama 3.1 8B and Flux.1) meet the moment for enterprise Gen AI, and the influx of first-time submitters expands the set of credible places to run—from platform OEMs to an academic HPC center to nimble clouds.
As organizations continue to push the boundaries of AI infrastructure, the industry is seeing an acceleration of hardware, software, and networking innovations that are making frontier AI models more accessible and deployable at scale.
As AI infrastructure evolves, MLPerf Training provides a vital benchmark for the industry, ensuring that progress in AI development is transparent, reproducible, and measurable.
MLCommons’ MLPerf Training v5.1 lands with three clear signals: generative AI continues to shape the benchmark mix, scaling discipline is where many teams are winning time, and the roster of credible submitters is getting broader. This round includes 65 unique systems from 20 organizations spanning silicon, systems, clouds—and, notably, an academic HPC center.
“The view I see in MLPerf is like a Formula One race—same track and rules, room for tuning, and you see who can finish fastest,” said Chuan Li, Chief Scientific Officer at Lambda, who led the company's MLPerf v5.1 efforts. It’s a clever encapsulation of MLPerf’s value proposition: standardized tasks and target quality keep the contest honest while leaving space for technique.
What’s new in v5.1 is squarely aimed at today’s workloads. Llama 3.1 8B replaces BERT for LLM pretraining—a modern, decoder-only architecture that fits on a single node (≤8 accelerators) yet mirrors software patterns used at larger scales. On the image side, Flux.1 replaces Stable Diffusion v2, reflecting the shift to transformer-based diffusion models with cleaner validation via loss. Together, these swaps align the suite with the stacks enterprises are actually deploying.
Momentum and scale show up in the submission patterns. Multi-node entries climbed sharply versus a year ago, and genAI tests drew heavy participation: Llama 3.1 8B debuted with strong interest, while Llama 2-70B LoRA continued to be a favorite fine-tuning proxy. Performance trends outpaced a simple Moore’s Law line again; gains came not just from fresh silicon but from numerics, software, and fabrics—exactly the kind of full-system work buyers need to see.
NVIDIA posted most of the fastest times and largest-scale runs this round—especially on GB200/GB300 NVL72 configurations—reflecting stack maturity and intra-rack NVLink scale. Still, the broader story is ecosystem momentum: new entrants, academic participation, and software/networking gains that turned more multi-node runs into reproducible, closed-division results.
Standout Highlights: New Entrants
• University of Florida (academic first-timer): Ran across seven benchmarks on HiPerGator DGX B200, including multi-node scaling to 448 GPUs, demonstrating closed-division reproducibility on a shared HPC environment.
• Wiwynn (platform new entrant): Kinabalu posted Llama 2-70B LoRA results at 72 and 576 GPUs on GB200 NVL72, signaling readiness of its NVLink-centric design for fine-tuning workloads.
• Datacrunch (cloud first-timer): Brought up an 8× B200 Llama 3.1-8B run via Slurm/Pyxis “Instant Clusters,” positioning for fast, reproducible re-runs rather than one-off hero numbers.
Precision and practicality deserve a note. Several submitters leaned into lower-precision training (FP8 → FP4 variants) where numerically stable, but MLPerf’s rubric keeps that grounded: time-to-target-quality forces any optimization to actually converge. The other big lever is networking and topology—RDMA over InfiniBand or tuned Ethernet, clean hierarchies, and reliability at scale—because eight nodes only help if they act like eight, not three.
Lambda: a Rack That Behaves Like One Giant GPU
Lambda was one of a short list to post on GB300 NVL72 (72 Blackwell Ultra GPUs in a single NVLink domain). Two takeaways surfaced in my side interview with Chuan Li. First, the speed-ups split roughly half-and-half between hardware (more memory, higher inter-GPU bandwidth) and software (driver/library/framework maturation).
Second, numerics helped at the margin: moving from FP8 to an FP4 variant delivered an additional double-digit percentage improvement while still meeting the accuracy target. There’s also a practical lesson here: clean, converged runs at the edge of scale require weeks of lined-up capacity and tight coordination across DC ops, fabric, and software. Useful proof point—one of 20, not the whole story.
“We saw a 1.66x speedup in our Llama 2-70B run compared to previous submissions,” Li said. “This performance improvement showcases the power of the latest NVIDIA hardware, combined with Lambda’s cloud orchestration capabilities.”
How to Read the Tables Without Getting Lost
If you’re using v5.1 to guide purchasing or platform bets, a few simple rules help:
• Compare like for like. Start within the same benchmark and similar accelerator counts. A 32-GPU result and a 512-GPU result are not interchangeable.
• Look for scale curves, not just a single number. Do you see near-linear improvements from 8 → 16 → 32 → 64 GPUs? That often tells you more than a hero time.
• Check the software notes. Frameworks, kernels, parallelism strategy, IO/storage, and data pipelines are where much of the delta lives—and MLPerf links to them.
• Use Llama 3.1 8B as a quick stack sanity test. It’s modern, single-node accessible, and a good proxy before you commit to larger spend.
• If you care about image generation, Flux.1 is the new reality. Expect different stress points than SD v2 (attention/memory/diffusion schedule) and plan tuning accordingly.
• Treat FP4 wins as conditional on convergence. The fastest path that misses target quality doesn’t count in MLPerf—and it shouldn’t in production, either.
TechArena Take
This round isn’t just about newer GPUs; it’s about maturing engineering. The two new tests (Llama 3.1 8B and Flux.1) meet the moment for enterprise Gen AI, and the influx of first-time submitters expands the set of credible places to run—from platform OEMs to an academic HPC center to nimble clouds.
As organizations continue to push the boundaries of AI infrastructure, the industry is seeing an acceleration of hardware, software, and networking innovations that are making frontier AI models more accessible and deployable at scale.
As AI infrastructure evolves, MLPerf Training provides a vital benchmark for the industry, ensuring that progress in AI development is transparent, reproducible, and measurable.



