X

AI Benchmarks Shift as MLPerf Highlights LLM Dominance

April 2, 2025

Last week, MLCommons dropped a benchmarking bombshell with the release of MLPerf Inference 5.0 — and the implications for the AI infrastructure world are massive. As one of the most trusted benchmarking efforts in machine learning, MLPerf continues to evolve at the pace of the industry it serves. And with this latest round of nearly 20,000 new results, a surge in large language model (LLM) submissions, and several new hardware entrants, the signal is clear: inference is having a moment.

Here’s the TechArena take on what matters most.

LLMs Take the Crown from ResNet-50


In what feels like a milestone moment, the ResNet-50 era is officially over — at least in terms of benchmark popularity. MLPerf Inference 5.0 marks the first time an LLM has overtaken ResNet-50 as the most frequently submitted workload. Specifically, LLAMA 2 70B now leads the pack, with 2.5x more submissions than it had a year ago. The benchmarking community – often conservative in adopting new workloads – is fully embracing the age of LLMs.

Why does this matter? Because benchmarks drive optimization, and optimization drives real-world performance. The more representative the benchmarks, the more aligned vendor innovation becomes with enterprise needs.

Introducing the Biggest LLM Benchmark Yet


MLPerf 5.0 introduced LLAMA 3.1 405B, the largest model ever benchmarked by the organization. And yes, it’s as heavy as it sounds — long context windows, massive parameter counts, and distributed inference across accelerators. The real challenge here isn’t just throughput; it’s achieving tight latency constraints while maintaining accuracy.

A few stats that stood out:

  • Median input token length: ~9,500
  • Time to first token (99th percentile): 6 seconds
  • Time per output token: ~175ms (faster than human reading speed)

Translation: These benchmarks aren’t theoretical — they’re reflecting production use cases like RAG (retrieval-augmented generation), agentic AI, and high-performance LLM APIs.

It’s Not Just About LLMs


MLPerf 5.0 also brought new benchmarks for Graph Neural Networks (GNNs) and automotive workloads:

  • The GNN benchmark features a relational graph attention network (RGAT) trained on a massive heterogeneous dataset with over 5 billion edges.
  • The new automotive benchmark, based on point painting and the Waymo Open Dataset, blends LiDAR and image processing — key for 3D object detection in real-time, safety-critical applications.

Both benchmarks reflect a broader point: inference is everywhere, from cloud AI services to edge deployments in cars and industrial systems.

The Hardware Evolution: FP4, Virtualization, and Liquid Cooling


From AMD’s Instinct MI325X and NVIDIA’s GB200 to Broadcom’s push for virtualized GPUs and Solidigm’s liquid-cooled SSDs, MLPerf 5.0 submissions captured an accelerating hardware shift.

We’re seeing:

  • Adoption of FP4 (4-bit floating point) for LLMs, pushing performance up to 3x while meeting tight accuracy requirements
  • Virtualized inference platforms from Broadcom and others that aim to mirror what VMware did for CPUs in the early 2000s
  • Liquid-cooled servers entering the mainstream, as GPUs and SSDs hit thermal thresholds that demand new data center designs

This round wasn’t just about faster silicon — it was about smarter system-level design.

The Industry Is All In


Submissions came from 23 organizations: AMD, ASUSTeK, Broadcom, CTuning, Cisco, CoreWeave, Dell, FlexAI, Fujitsu, GATEOverflow, Giga Computing, Google, HPE, Intel, Krai, Lambda, Lenovo, MangoBoost, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Sustainable Metal Cloud. The “open” division also gained traction, giving software-focused companies a chance to shine by showcasing algorithmic and architectural innovations outside strict benchmark constraints.

As for data center decision-makers, MLPerf continues to offer a valuable lens for understanding how platforms evolve over time. Submitters now see MLPerf as more than a race — it’s a way to validate software stacks, evaluate scaling strategies, and compare performance under realistic production constraints.

With plans underway to replace older models like GPT-J, expand low-latency LLM scenarios, and broaden the edge inference suite, MLPerf is already planning for version 5.1. And if the growth we saw this round continues, the next wave of results will reflect even more LLM momentum, cross-industry relevance, and workload diversity.  

At TechArena, we’ll continue tracking this fast-moving space and bringing clarity to the deluge of data. Because when benchmarks are this influential, they don’t just reflect the market — they help shape it.

Want to explore the full MLPerf 5.0 dataset? Dig into the Tableau dashboards now via MLCommons.org.

Last week, MLCommons dropped a benchmarking bombshell with the release of MLPerf Inference 5.0 — and the implications for the AI infrastructure world are massive. As one of the most trusted benchmarking efforts in machine learning, MLPerf continues to evolve at the pace of the industry it serves. And with this latest round of nearly 20,000 new results, a surge in large language model (LLM) submissions, and several new hardware entrants, the signal is clear: inference is having a moment.

Here’s the TechArena take on what matters most.

LLMs Take the Crown from ResNet-50


In what feels like a milestone moment, the ResNet-50 era is officially over — at least in terms of benchmark popularity. MLPerf Inference 5.0 marks the first time an LLM has overtaken ResNet-50 as the most frequently submitted workload. Specifically, LLAMA 2 70B now leads the pack, with 2.5x more submissions than it had a year ago. The benchmarking community – often conservative in adopting new workloads – is fully embracing the age of LLMs.

Why does this matter? Because benchmarks drive optimization, and optimization drives real-world performance. The more representative the benchmarks, the more aligned vendor innovation becomes with enterprise needs.

Introducing the Biggest LLM Benchmark Yet


MLPerf 5.0 introduced LLAMA 3.1 405B, the largest model ever benchmarked by the organization. And yes, it’s as heavy as it sounds — long context windows, massive parameter counts, and distributed inference across accelerators. The real challenge here isn’t just throughput; it’s achieving tight latency constraints while maintaining accuracy.

A few stats that stood out:

  • Median input token length: ~9,500
  • Time to first token (99th percentile): 6 seconds
  • Time per output token: ~175ms (faster than human reading speed)

Translation: These benchmarks aren’t theoretical — they’re reflecting production use cases like RAG (retrieval-augmented generation), agentic AI, and high-performance LLM APIs.

It’s Not Just About LLMs


MLPerf 5.0 also brought new benchmarks for Graph Neural Networks (GNNs) and automotive workloads:

  • The GNN benchmark features a relational graph attention network (RGAT) trained on a massive heterogeneous dataset with over 5 billion edges.
  • The new automotive benchmark, based on point painting and the Waymo Open Dataset, blends LiDAR and image processing — key for 3D object detection in real-time, safety-critical applications.

Both benchmarks reflect a broader point: inference is everywhere, from cloud AI services to edge deployments in cars and industrial systems.

The Hardware Evolution: FP4, Virtualization, and Liquid Cooling


From AMD’s Instinct MI325X and NVIDIA’s GB200 to Broadcom’s push for virtualized GPUs and Solidigm’s liquid-cooled SSDs, MLPerf 5.0 submissions captured an accelerating hardware shift.

We’re seeing:

  • Adoption of FP4 (4-bit floating point) for LLMs, pushing performance up to 3x while meeting tight accuracy requirements
  • Virtualized inference platforms from Broadcom and others that aim to mirror what VMware did for CPUs in the early 2000s
  • Liquid-cooled servers entering the mainstream, as GPUs and SSDs hit thermal thresholds that demand new data center designs

This round wasn’t just about faster silicon — it was about smarter system-level design.

The Industry Is All In


Submissions came from 23 organizations: AMD, ASUSTeK, Broadcom, CTuning, Cisco, CoreWeave, Dell, FlexAI, Fujitsu, GATEOverflow, Giga Computing, Google, HPE, Intel, Krai, Lambda, Lenovo, MangoBoost, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Sustainable Metal Cloud. The “open” division also gained traction, giving software-focused companies a chance to shine by showcasing algorithmic and architectural innovations outside strict benchmark constraints.

As for data center decision-makers, MLPerf continues to offer a valuable lens for understanding how platforms evolve over time. Submitters now see MLPerf as more than a race — it’s a way to validate software stacks, evaluate scaling strategies, and compare performance under realistic production constraints.

With plans underway to replace older models like GPT-J, expand low-latency LLM scenarios, and broaden the edge inference suite, MLPerf is already planning for version 5.1. And if the growth we saw this round continues, the next wave of results will reflect even more LLM momentum, cross-industry relevance, and workload diversity.  

At TechArena, we’ll continue tracking this fast-moving space and bringing clarity to the deluge of data. Because when benchmarks are this influential, they don’t just reflect the market — they help shape it.

Want to explore the full MLPerf 5.0 dataset? Dig into the Tableau dashboards now via MLCommons.org.

Transcript

Subscribe to TechArena

Subscribe