X

Delivering AI Performance at Scale with CoreWeave

Data Center
Allyson Klein
April 15, 2024

I recently attended NVIDIA GTC, called by some as the Woodstock moment of the AI Era, and I’m still unpacking what we learned there about industry innovation to fuel AI workloads. While the TechArena packed as many conversations possible with industry innovators at the event, one conversation that stood above the rest was our interview with CoreWeave’s Jacob Yundt. He leads infrastructure buildout for CoreWeave as they chart a trajectory for delivering unparalleled scale for AI training in the cloud.

How did they do it? As we have seen at many inflection points, CoreWeave took advantage of not being encumbered by legacy to deliver a cloud stack that was specially built for AI training clusters from initial provisioning to health checks, orchestration and scheduling. This enables the company to bring up a staggering amount of GPUs to a particular training task at warp speed while providing reliable compute throughout the training period. CoreWeave provides proactive oversight of its instances to ensure that precious training cycles are not disrupted based on potential hardware failures, I/O issues, or other maladies that confront data center infrastructure.

CoreWeave has developed a cult-like following amongst AI startups looking to train algorithms where speed to train often is the difference for market opportunity. Jacob clarified their market focus on any customer looking to do “ground-breaking work at incredible scale”, and this speaks to the type of underlying infrastructure requirements they have across compute, storage, and network. And the demand for this infrastructure is stark. CoreWeave has been on record stating that power demand alone from its training clusters may stress local power grids in the communities where it operates, and the demand for CoreWeave is also growing exponentially. Valued at $7B last December, the latest discussion of valuation of the company four months later has surged to $16B underscoring the growth potential for AI training.

So what infrastructure is CoreWeave tapping to deliver their AI service? It’s no secret that their training relies on NVIDIA GPUs, and CoreWeave will be integrating next generation Blackwell GPUs into clusters utilizing liquid cooling technologies. But Jacob stressed that there’s more than GPUs that goes into the groundbreaking scale they’ve been able to achieve. That scale starts with re-imagining the data pipeline, and CoreWeave has leaned into a strategic partnership with VAST Data to deliver innovative data management and control that scales with GPU performance needs. VAST Data’s platform has driven new capabilities for managing data sets to bring data more efficiently and quickly to the processing complex eliminating much of the overhead associated with traditional tiered storage solutions.

Jacob stated that the collaboration with VAST Data begins with his team’s love of QLC storage and the careful balance between performance, capacity and efficiency that QLC delivers. To say that Jacob is a fan of QLC is an understatement, and it’s no surprise given QLC’s advantages over TLC technology in delivering increased data density per cell. Jacob stated that his long-standing collaboration with Solidigm has ensured QLC deployment in his data centers with a partnership that extends beyond procurement to account and engineering support. When you consider the size of LLMs being trained at CoreWeave, it’s easy to guess that that’s a lot of QLC NAND being deployed.

So what’s next for CoreWeave? Watch this space to learn more about their continued infrastructure buildout as a harbinger of broader AI market adoption. I’m also interested to see if CoreWeave can make a dent in the cloud service provider landscape with their built for AI training stack. I’ll also be reporting on advances of the data pipeline infrastructure industry including in my Data Insights series with Solidigm.

I recently attended NVIDIA GTC, called by some as the Woodstock moment of the AI Era, and I’m still unpacking what we learned there about industry innovation to fuel AI workloads. While the TechArena packed as many conversations possible with industry innovators at the event, one conversation that stood above the rest was our interview with CoreWeave’s Jacob Yundt. He leads infrastructure buildout for CoreWeave as they chart a trajectory for delivering unparalleled scale for AI training in the cloud.

How did they do it? As we have seen at many inflection points, CoreWeave took advantage of not being encumbered by legacy to deliver a cloud stack that was specially built for AI training clusters from initial provisioning to health checks, orchestration and scheduling. This enables the company to bring up a staggering amount of GPUs to a particular training task at warp speed while providing reliable compute throughout the training period. CoreWeave provides proactive oversight of its instances to ensure that precious training cycles are not disrupted based on potential hardware failures, I/O issues, or other maladies that confront data center infrastructure.

CoreWeave has developed a cult-like following amongst AI startups looking to train algorithms where speed to train often is the difference for market opportunity. Jacob clarified their market focus on any customer looking to do “ground-breaking work at incredible scale”, and this speaks to the type of underlying infrastructure requirements they have across compute, storage, and network. And the demand for this infrastructure is stark. CoreWeave has been on record stating that power demand alone from its training clusters may stress local power grids in the communities where it operates, and the demand for CoreWeave is also growing exponentially. Valued at $7B last December, the latest discussion of valuation of the company four months later has surged to $16B underscoring the growth potential for AI training.

So what infrastructure is CoreWeave tapping to deliver their AI service? It’s no secret that their training relies on NVIDIA GPUs, and CoreWeave will be integrating next generation Blackwell GPUs into clusters utilizing liquid cooling technologies. But Jacob stressed that there’s more than GPUs that goes into the groundbreaking scale they’ve been able to achieve. That scale starts with re-imagining the data pipeline, and CoreWeave has leaned into a strategic partnership with VAST Data to deliver innovative data management and control that scales with GPU performance needs. VAST Data’s platform has driven new capabilities for managing data sets to bring data more efficiently and quickly to the processing complex eliminating much of the overhead associated with traditional tiered storage solutions.

Jacob stated that the collaboration with VAST Data begins with his team’s love of QLC storage and the careful balance between performance, capacity and efficiency that QLC delivers. To say that Jacob is a fan of QLC is an understatement, and it’s no surprise given QLC’s advantages over TLC technology in delivering increased data density per cell. Jacob stated that his long-standing collaboration with Solidigm has ensured QLC deployment in his data centers with a partnership that extends beyond procurement to account and engineering support. When you consider the size of LLMs being trained at CoreWeave, it’s easy to guess that that’s a lot of QLC NAND being deployed.

So what’s next for CoreWeave? Watch this space to learn more about their continued infrastructure buildout as a harbinger of broader AI market adoption. I’m also interested to see if CoreWeave can make a dent in the cloud service provider landscape with their built for AI training stack. I’ll also be reporting on advances of the data pipeline infrastructure industry including in my Data Insights series with Solidigm.

Subscribe to TechArena

Subscribe