An Industry Legend Offers His View of Cooling for the AI Era
Andy Bechtosheim is a legend of the computing industry. Currently at Arista Networks, Andy is arguably best known for founding Sun Microsystems but also founded Granite Systems, a gigabit ethernet startup acquired by Cisco, and of course Arista. We was also a key founder of the OCP organization. He also is known for being a savvy silicon valley investor including one of the earliest investors in Google funding Larry and Sergey before they’d incorporated. Depending on who you want to believe, he also encouraged them to name the company Google. He has received the Smithsonian Leadership Award for Innovation, is an elected member of the National Academy of Engineering, as was recognized as the person who has delivered the most to server innovation over the past 20 years by IT Pros. He has a knack for understanding deep technology as well as being on the cusp of what’s next, which is why I prioritized his talk on the AI Data Center at OCP Summit.
Andy started his talk today identifying that generative AI performance requirements are growing 10X per year. Where does a 100X performance breakthrough come from over the next few years? Andy called out architectural improvements, next gen processor advancements, optimized number representations including microscaling formats, and higher memory bandwidth among key elements for the industry to focus. Process technology alone will not get us there as the industry moves from 5nm to 3nm and 2nm technologies. 2.5D and 3D packaging technologies will help as available silicon real estate will basically double. Together, these represent at 5X performance optimization…not enough.
Andy suggested that doubling power and improving cooling will be necessary to drive performance required. He laid out a vision for a 3000W GPU chip representing historic challenges for cooling technologies. He introduced a focus on diamond substrate technology (DST) 5X better conductor than copper leading to a reduction of hot spots on a die and more effective area for liquid cooling technologies to be effective. So what cooling is required? Liquid cooled data centers are the new design focus due to AI. Andy called for a holistic design approach from chip to rack to building.
He introduced the Open AI rack GPU/CPU configuration that delivers 100-200 KW/rack supporting 16 2U GPU blades with supporting switch and power shelves. The practical limit for air cooling, for contrast, is 50 KW using heat exchanger technology. The fabric inside the rack utilizes passive copper cables offering the lowest power and high reliability. Andy believes that this rack technology is the future for AI clusters and called for the industry to standardize this specification so that engineering development is not fragmented over redundant solutions.
But what if folks want to go farther to 200 KW? And they do…HPE/Cray has a 200KW system on the market today as one example of industry delivery at higher power. For this, immersion cooled racks will rule the day. All heat in this case is removed from immersion liquid. Here, the industry is still not settled on the liquid to be used between water, mineral oil and fluorinert. Water would be preferred, but there are challenges with corrosion, bacterial introduction, and ongoing water chemistry maintenance required. Two phased cooling using alternative liquids can avoid pumps or ongoing maintenance for antibacterial or anticorrosion treatment. What does this mean for IT operators? Andy called for more standards to come into play as the industry is limited to bespoke solutions that will not achieve the scale the requirements for our impending AI era.
The TechArena’s take from this rapidfire talk at OCP Summit: the fact that Andy focused on something as esoteric as liquid cooling underscores the pending challenge for performance scaling for AI. The ongoing death of Moore’s Law is driving up energy draw and heat from server silicon, and processing requirements of AI are growing at the same time at unprecedented rates. Expect to see rapid innovation in the liquid cooling arena, and look for OCP’s sustainability initiative to drive a wedge into standardization efforts in this arena.
Andy Bechtosheim is a legend of the computing industry. Currently at Arista Networks, Andy is arguably best known for founding Sun Microsystems but also founded Granite Systems, a gigabit ethernet startup acquired by Cisco, and of course Arista. We was also a key founder of the OCP organization. He also is known for being a savvy silicon valley investor including one of the earliest investors in Google funding Larry and Sergey before they’d incorporated. Depending on who you want to believe, he also encouraged them to name the company Google. He has received the Smithsonian Leadership Award for Innovation, is an elected member of the National Academy of Engineering, as was recognized as the person who has delivered the most to server innovation over the past 20 years by IT Pros. He has a knack for understanding deep technology as well as being on the cusp of what’s next, which is why I prioritized his talk on the AI Data Center at OCP Summit.
Andy started his talk today identifying that generative AI performance requirements are growing 10X per year. Where does a 100X performance breakthrough come from over the next few years? Andy called out architectural improvements, next gen processor advancements, optimized number representations including microscaling formats, and higher memory bandwidth among key elements for the industry to focus. Process technology alone will not get us there as the industry moves from 5nm to 3nm and 2nm technologies. 2.5D and 3D packaging technologies will help as available silicon real estate will basically double. Together, these represent at 5X performance optimization…not enough.
Andy suggested that doubling power and improving cooling will be necessary to drive performance required. He laid out a vision for a 3000W GPU chip representing historic challenges for cooling technologies. He introduced a focus on diamond substrate technology (DST) 5X better conductor than copper leading to a reduction of hot spots on a die and more effective area for liquid cooling technologies to be effective. So what cooling is required? Liquid cooled data centers are the new design focus due to AI. Andy called for a holistic design approach from chip to rack to building.
He introduced the Open AI rack GPU/CPU configuration that delivers 100-200 KW/rack supporting 16 2U GPU blades with supporting switch and power shelves. The practical limit for air cooling, for contrast, is 50 KW using heat exchanger technology. The fabric inside the rack utilizes passive copper cables offering the lowest power and high reliability. Andy believes that this rack technology is the future for AI clusters and called for the industry to standardize this specification so that engineering development is not fragmented over redundant solutions.
But what if folks want to go farther to 200 KW? And they do…HPE/Cray has a 200KW system on the market today as one example of industry delivery at higher power. For this, immersion cooled racks will rule the day. All heat in this case is removed from immersion liquid. Here, the industry is still not settled on the liquid to be used between water, mineral oil and fluorinert. Water would be preferred, but there are challenges with corrosion, bacterial introduction, and ongoing water chemistry maintenance required. Two phased cooling using alternative liquids can avoid pumps or ongoing maintenance for antibacterial or anticorrosion treatment. What does this mean for IT operators? Andy called for more standards to come into play as the industry is limited to bespoke solutions that will not achieve the scale the requirements for our impending AI era.
The TechArena’s take from this rapidfire talk at OCP Summit: the fact that Andy focused on something as esoteric as liquid cooling underscores the pending challenge for performance scaling for AI. The ongoing death of Moore’s Law is driving up energy draw and heat from server silicon, and processing requirements of AI are growing at the same time at unprecedented rates. Expect to see rapid innovation in the liquid cooling arena, and look for OCP’s sustainability initiative to drive a wedge into standardization efforts in this arena.