Discover all the highlights from OCP > VIEW our coverage
X

5 Ways To Keep GPUs Fed—Hard-Won Lessons From Our AI Factory Panel

October 23, 2025

OCP Summit put a spotlight on something I’ve been watching for a while: AI has turned the data center into a living system. The optimization target isn’t a single server box anymore. It’s the confluence of power, liquid cooling, interconnects, storage tiers, and software orchestration, all moving in step to serve wildly dynamic workloads. I was lucky enough to moderate an AI Factory panel last week with CoreWeave’s Jacob Yundt, Dell’s Peter Corbett, NVIDIA’s CJ Newburn, Solidigm’s Alan Bumgarner, and VAST Data’s Glenn Lockwood. These seasoned data center veterans have experience from across the value chain and provided some fantastic insights on how swiftly data center innovation is moving to address customer demand.

Designing for Racks and Multi-Racks

The clearest through-line in the conversation was the industry shift from box-level thinking to rack- and row-scale system design. That change cascades into everything from how you bring in power and manage heat to how you route traffic, where you place data, and how you orchestrate workflows. Jacob framed the pivot crisply, discussing CoreWeave’s move from “thinking of compute as individual units” to racks, rows, and entire data centers, where “everything just needs to work together… power delivery, liquid cooling, networking, storage.”  

Glenn jumped on this foundation, explaining why VAST doesn’t merely see that as just a need for more network connectivity but a call for application-specific disaggregation. He clarified that rack-scale and scale-up interconnects lets you design for how the work runs, not just push everything through a generic L3 and hope for the best. He also pointed to checkpointing patterns and the pairing of global shared storage with node-local SSDs to maximize performance and economics.  

“Feed the GPUs”: Idle Time is Lost Opportunity and Unbelievable Cost

In AI data centers, compute utilization, more than ever, defines the business outcome. Peter outlined Dell’s view on the topic, plainly explaining that ensuring high utilization means feeding accelerators while not recomputing what you already computed. Operators need to build the surrounding storage and network capabilities so intermediate results and checkpoints can be reused at speed.  

That’s harder than it sounds because the software surface is changing at breakneck pace. CJ put a timestamp on it, reflecting NVIDIA’s torrid pace of innovation: “The only thing I’m really confident in is that it’ll be completely different in three weeks.” He emphasized that orchestration must continually decide where data should live, when to move it, and when remote access beats relocation.

Jacob echoed this sentiment stating, “The pace of deployment right now is unlike anything I’ve ever seen… the cost of getting it wrong… is so expensive… if your network or storage is not fast enough.”  When considering the scale that CoreWeave is operating, it was clear that guessing wrong here could make or break financial success.

Data Management is Ready for its Close-Up

While it was no surprise that storage was a central discussion point of the panel… after all, Solidigm did host this breakfast… the fervor with which the panelists spoke about storage underscored its architectural importance to this era of the data center. Their message? Stop relegating data plumbing to a downstream team. From metadata and indexing to checkpoint lifecycle and synthetic-data governance, data is foundational to the AI pipeline.

Peter walked through why. Training, inference, RAG, and fine-tuning each have distinct data modes and retention demands. Checkpoints must be instantly restorable and archived for reproducibility, while the generated/ simulated data used to train other LLMs needs cataloging at massive scale. He went further forecasting that “we could see a hundred-x increase” in those volumes.  

On the abstraction side, CJ argued for higher-level interfaces that let users express intent, “just go get my data wherever it is,” while the stack decides placement, sharding, staging, and presentation under the hood. That decoupling gives vendors freedom to innovate behind stable APIs as hardware and media evolve.

And the inside-the-enterprise reality check came from none other than Alan, sharing that  after a company-wide AI enablement push, Solidigm had discovered very quickly how unclean their data was. His sharing was met with many nods around the room and reflected my own discussions with enterprises that have shared the same scar tissue. To get the most out of AI, governance and operational control are no longer optional programs – they’re table stakes.

Efficiency is Required Everywhere

With the financial and resource-based investment in AI factories, the focus on efficiency has never been sharper. Liquid cooling has moved from science project to assumed ingredient. Jacob didn’t mince words, stating the importance of this innovation for companies like CoreWeave. He explained that the next frontier is tying everything together so when a 20-line command is about to launch and burn tens of megawatts, the power and cooling systems respond proactively, not reactively.

Peter built on this, highlighting why even storage needs thermal re-thinking. Multi-kilowatt drive enclosures are real, and standards groups including SNIA and OCP are working on how to cool serviceable drives in high-density racks, because you still need to pull and replace them safely.

Glenn then took the conversation further, uncovering a somewhat silent efficiency killer. He explained that when something fails or stalls, entire clusters idle and argued for communities like OCP to push shared semantics for reliability in practice so we can identify link flaps, hanging collectives, and other at-scale “ugly bits” quickly, rather than burning cycles waiting for the system to realize that it’s reached an unhealthy state.  

One culprit in this equation is how storage granularity can wreck energy math for certain vector-style workloads. CJ provided an example: reading 16–32KB NAND pages to ID ~512 bytes of useful data leads to absurd power budgets. The fix? Targeting coordinated innovation across drives, controllers, firmware, and software to reduce over-fetch and align I/O to what a workload actually needs.  

Alan added a note on fabric efficiency at scale, explaining that he’s seeing reports of very high utilization on enormous clusters, which are impressive, but it makes the stack hypersensitive to tail latencies and congestion. Getting protocol/ software choices right is what keeps utilization sustained, not just peaky.  

Standards are Required…and Need to Innovate Themselves

As veterans of the data center industry, the panelists knew well that open standards are an innovation multiplier. Peter put it succinctly, stating that standards have been essential for storage innovation across hardware and protocol interfaces because they let multiple vendors innovate behind them and lower barriers to entry for newcomers.

But when you are moving so quickly, can traditional storage approaches keep up? CJ offered a contrarian view, stating that we shouldn’t rush to standardize designs that limit concurrency or add unnecessary complexity. Instead, we should run experiments, share data, then bring the minimum viable interfaces to standards bodies so we minimize time to useful production without locking in untested ideas.

A 2030 Vision: One Mega-Rack Fueled by Lots of Goop

If you compress the future that our panel described into a single image, it’s this: An open chiplet designed fueled single rack, consuming 50 megawatts, surrounded by goop, running elegant software that is utilizing it. That was my summary of the panelists prognostications of a 5-year innovation horizon.

Jacob kicked us off with a view of  “football fields of infrastructure for power and cooling,” feeding one mega-rack at ~50 MW. This was not simply a joke, but the logical endpoint of density, liquid loops everywhere, and efficiency work that compresses visible compute while everything around it scales out of sight.

Glenn took the idea further, claiming that the “one rack” only works when it’s surrounded by the unglamorous machinery that turns data into answers. He explained that the AI factory of the future would be full of “goop,” the power, cooling, storage and networking required for compute, and at the center GPUs actually driving insights.  

On software, CJ pushed the same abstraction thread forward. He sees a future where operators declare intent (“go get my data wherever it is”), while the stack optimizes where data lives, how it’s sharded, and how it’s staged/presented across heterogeneous building blocks. That keeps innovation vibrant underneath stable interfaces as media, fabrics, and accelerators evolve.

On hardware, Alan expects innovation to move inside the package: by 2030, we’ll need new standards because we can combine different things inside an SoC in ways that aren’t possible at rack or PCB scales today. That’s the open chip economy showing up in real design choices including interfaces, subsystems, and thermal/ power tech that collectively ripples up into rack scale architecture.

And Peter widened the lens: the power envelope itself becomes a macro-catalyst. If the capacity required for AI is built primarily with low-carbon generation, the grid’s evolution can accelerate through economies of scale—changing siting, cooling, and heat-reclaim strategies along the way.  

The TechArena Take

AI isn’t just turning the data center into one big server, it’s turning it into a cohesive and interdependent organism. The amount of collaboration required to execute at this design point is unlike anything we have seen before. Aligning on open, agile interfaces that increase industry velocity will be critical to our collective success. Designing everything to remove inefficiency to the whole is now non-negotiable, and a seat is open for anyone at the table who has bold ideas for efficiency breakthroughs.

Subscribe to our newsletter

OCP Summit put a spotlight on something I’ve been watching for a while: AI has turned the data center into a living system. The optimization target isn’t a single server box anymore. It’s the confluence of power, liquid cooling, interconnects, storage tiers, and software orchestration, all moving in step to serve wildly dynamic workloads. I was lucky enough to moderate an AI Factory panel last week with CoreWeave’s Jacob Yundt, Dell’s Peter Corbett, NVIDIA’s CJ Newburn, Solidigm’s Alan Bumgarner, and VAST Data’s Glenn Lockwood. These seasoned data center veterans have experience from across the value chain and provided some fantastic insights on how swiftly data center innovation is moving to address customer demand.

Designing for Racks and Multi-Racks

The clearest through-line in the conversation was the industry shift from box-level thinking to rack- and row-scale system design. That change cascades into everything from how you bring in power and manage heat to how you route traffic, where you place data, and how you orchestrate workflows. Jacob framed the pivot crisply, discussing CoreWeave’s move from “thinking of compute as individual units” to racks, rows, and entire data centers, where “everything just needs to work together… power delivery, liquid cooling, networking, storage.”  

Glenn jumped on this foundation, explaining why VAST doesn’t merely see that as just a need for more network connectivity but a call for application-specific disaggregation. He clarified that rack-scale and scale-up interconnects lets you design for how the work runs, not just push everything through a generic L3 and hope for the best. He also pointed to checkpointing patterns and the pairing of global shared storage with node-local SSDs to maximize performance and economics.  

“Feed the GPUs”: Idle Time is Lost Opportunity and Unbelievable Cost

In AI data centers, compute utilization, more than ever, defines the business outcome. Peter outlined Dell’s view on the topic, plainly explaining that ensuring high utilization means feeding accelerators while not recomputing what you already computed. Operators need to build the surrounding storage and network capabilities so intermediate results and checkpoints can be reused at speed.  

That’s harder than it sounds because the software surface is changing at breakneck pace. CJ put a timestamp on it, reflecting NVIDIA’s torrid pace of innovation: “The only thing I’m really confident in is that it’ll be completely different in three weeks.” He emphasized that orchestration must continually decide where data should live, when to move it, and when remote access beats relocation.

Jacob echoed this sentiment stating, “The pace of deployment right now is unlike anything I’ve ever seen… the cost of getting it wrong… is so expensive… if your network or storage is not fast enough.”  When considering the scale that CoreWeave is operating, it was clear that guessing wrong here could make or break financial success.

Data Management is Ready for its Close-Up

While it was no surprise that storage was a central discussion point of the panel… after all, Solidigm did host this breakfast… the fervor with which the panelists spoke about storage underscored its architectural importance to this era of the data center. Their message? Stop relegating data plumbing to a downstream team. From metadata and indexing to checkpoint lifecycle and synthetic-data governance, data is foundational to the AI pipeline.

Peter walked through why. Training, inference, RAG, and fine-tuning each have distinct data modes and retention demands. Checkpoints must be instantly restorable and archived for reproducibility, while the generated/ simulated data used to train other LLMs needs cataloging at massive scale. He went further forecasting that “we could see a hundred-x increase” in those volumes.  

On the abstraction side, CJ argued for higher-level interfaces that let users express intent, “just go get my data wherever it is,” while the stack decides placement, sharding, staging, and presentation under the hood. That decoupling gives vendors freedom to innovate behind stable APIs as hardware and media evolve.

And the inside-the-enterprise reality check came from none other than Alan, sharing that  after a company-wide AI enablement push, Solidigm had discovered very quickly how unclean their data was. His sharing was met with many nods around the room and reflected my own discussions with enterprises that have shared the same scar tissue. To get the most out of AI, governance and operational control are no longer optional programs – they’re table stakes.

Efficiency is Required Everywhere

With the financial and resource-based investment in AI factories, the focus on efficiency has never been sharper. Liquid cooling has moved from science project to assumed ingredient. Jacob didn’t mince words, stating the importance of this innovation for companies like CoreWeave. He explained that the next frontier is tying everything together so when a 20-line command is about to launch and burn tens of megawatts, the power and cooling systems respond proactively, not reactively.

Peter built on this, highlighting why even storage needs thermal re-thinking. Multi-kilowatt drive enclosures are real, and standards groups including SNIA and OCP are working on how to cool serviceable drives in high-density racks, because you still need to pull and replace them safely.

Glenn then took the conversation further, uncovering a somewhat silent efficiency killer. He explained that when something fails or stalls, entire clusters idle and argued for communities like OCP to push shared semantics for reliability in practice so we can identify link flaps, hanging collectives, and other at-scale “ugly bits” quickly, rather than burning cycles waiting for the system to realize that it’s reached an unhealthy state.  

One culprit in this equation is how storage granularity can wreck energy math for certain vector-style workloads. CJ provided an example: reading 16–32KB NAND pages to ID ~512 bytes of useful data leads to absurd power budgets. The fix? Targeting coordinated innovation across drives, controllers, firmware, and software to reduce over-fetch and align I/O to what a workload actually needs.  

Alan added a note on fabric efficiency at scale, explaining that he’s seeing reports of very high utilization on enormous clusters, which are impressive, but it makes the stack hypersensitive to tail latencies and congestion. Getting protocol/ software choices right is what keeps utilization sustained, not just peaky.  

Standards are Required…and Need to Innovate Themselves

As veterans of the data center industry, the panelists knew well that open standards are an innovation multiplier. Peter put it succinctly, stating that standards have been essential for storage innovation across hardware and protocol interfaces because they let multiple vendors innovate behind them and lower barriers to entry for newcomers.

But when you are moving so quickly, can traditional storage approaches keep up? CJ offered a contrarian view, stating that we shouldn’t rush to standardize designs that limit concurrency or add unnecessary complexity. Instead, we should run experiments, share data, then bring the minimum viable interfaces to standards bodies so we minimize time to useful production without locking in untested ideas.

A 2030 Vision: One Mega-Rack Fueled by Lots of Goop

If you compress the future that our panel described into a single image, it’s this: An open chiplet designed fueled single rack, consuming 50 megawatts, surrounded by goop, running elegant software that is utilizing it. That was my summary of the panelists prognostications of a 5-year innovation horizon.

Jacob kicked us off with a view of  “football fields of infrastructure for power and cooling,” feeding one mega-rack at ~50 MW. This was not simply a joke, but the logical endpoint of density, liquid loops everywhere, and efficiency work that compresses visible compute while everything around it scales out of sight.

Glenn took the idea further, claiming that the “one rack” only works when it’s surrounded by the unglamorous machinery that turns data into answers. He explained that the AI factory of the future would be full of “goop,” the power, cooling, storage and networking required for compute, and at the center GPUs actually driving insights.  

On software, CJ pushed the same abstraction thread forward. He sees a future where operators declare intent (“go get my data wherever it is”), while the stack optimizes where data lives, how it’s sharded, and how it’s staged/presented across heterogeneous building blocks. That keeps innovation vibrant underneath stable interfaces as media, fabrics, and accelerators evolve.

On hardware, Alan expects innovation to move inside the package: by 2030, we’ll need new standards because we can combine different things inside an SoC in ways that aren’t possible at rack or PCB scales today. That’s the open chip economy showing up in real design choices including interfaces, subsystems, and thermal/ power tech that collectively ripples up into rack scale architecture.

And Peter widened the lens: the power envelope itself becomes a macro-catalyst. If the capacity required for AI is built primarily with low-carbon generation, the grid’s evolution can accelerate through economies of scale—changing siting, cooling, and heat-reclaim strategies along the way.  

The TechArena Take

AI isn’t just turning the data center into one big server, it’s turning it into a cohesive and interdependent organism. The amount of collaboration required to execute at this design point is unlike anything we have seen before. Aligning on open, agile interfaces that increase industry velocity will be critical to our collective success. Designing everything to remove inefficiency to the whole is now non-negotiable, and a seat is open for anyone at the table who has bold ideas for efficiency breakthroughs.

Subscribe to our newsletter

Transcript

Subscribe to TechArena

Subscribe