
TechArena host Allyson Klein chat’s with Digital Sunshine’s Gina Rosenthal about how AI is reshaping marketing as the two preview Gestalt IT’s upcoming AI Field Day 4.

TechArena host Allyson Klein chats with Alphawave Semi’s VP of marketing and product management Letizia Guiliano about the progress the industry has undertaken in the past year on an open market for chiplet innovation and how her company is positioned to thrive on the growth of chiplet solutions readying for market.

I attended Juniper Network’s analyst and influencer call a few weeks ago, in advance of today’s announcement from the company, to learn more about Juniper’s strategy in the year ahead. You may recall that Juniper was featured at the last Cloud Field Day event, where we got a deep dive into their Apstra solution. Today, as a surprise to no one, Juniper wanted to talk about AI. In fact, they doubled down on the AI Era with the introduction of their AI-Native Network Platform, claiming it as the industry’s first AI-Native solution.
There is absolutely truth that network capability within this brave new world of computing is critical to maximizing organizational return. Applying AI across the network to collect data on the full network experience will enable network operations teams new tools in their toolboxes to manage and optimize the network, and that’s reason enough to make Juniper’s move here quite interesting. When you look at the claims of improved efficiency including elimination of 90% of network trouble tickets, 85% of IT onsite visits, and cutting network issue resolution time in half, you realize that this is not just an interesting academic step into AI but a potential game changer for network operations management holistically.
How do they plan to pull it off? Juniper is targeting leadership with a common cloud native, microservices based AIOPS for the network across data center, campus, and branch, with wider automated WAN coming in the near future. Juniper’s CEO Rami Rahim was very quick to point out that the AIOPS platform was over seven years in the making, forming a major market advantage vs. other network providers. The team went further to discuss deep partner collaborations to bring this core capability to market with the support required to enable swift deployment that works from day one. A highlight of their offering is the introduction of the new Marvis Minis which is automating problem resolution in real-time without users needing to be engaged. They coupled this with Marvis VNA for the data center, delivering a Chat GPT like conversational interface to drive performance of critical applications and ensure improved customer experience. These tools make the claims mentioned above seem attainable when you consider what they provide network ops teams vs. currently available solutions.
Will Juniper be successful in this differentiation? Questions from the analysts and influencers in the session focused on the depth of capability to the new solutions offerings that integrated automation and deeper awareness into the network. There was a drilling down into the topic of WAN as there was no mention of specific AI tools for the WAN environment in our briefing. Juniper acknowledged that the WAN represents broader complexities for integration of these types of tools, but they called out the microservice based common cloud as key to the road forward.
Ethernet fabrics also came up given the importance of this technology for AI compute clusters. This is an area where Cisco has been leading, and Juniper did not have a complete answer when I asked about their strategy at Cloud Field Day. I think this is a space to watch from Juniper moving forward as they underscored the importance of advances to the technology to meet specific AI requirements in their response and were more front footed that more details would follow.
What is the TechArena take on Juniper’s direction? I love the lean in to AI as a strategic element of automating the network end-to-end. I appreciated the honesty on complexity of the WAN space and expect to hear more about their engagement in WAN in the months ahead. The same holds true for data center fabrics. I expect this to be an arena of massive industry advancement with Ultra Ethernet standards delivery to compete with InfiniBand in the advanced edge of capability, and Juniper in my mind will have a role to play in delivery of fabric innovation. Overall, I’m intrigued by Juniper’s announcement today as it underscores a symbiotic relationship between the wide-reaching impact of AI’s transformation of all computing functions including network automation and the importance of network innovation to the future of AI. I do expect other network providers to swiftly follow suit. After all, they too have been privy to the endless industry chatter on AI advancement that was the zeitgeist of 2023. But today is Juniper Network’s day in the sun. Kudos to the team for pulling off this leadership announcement. The TechArena team can’t wait to hear more.

Voltron Data was recently featured on the TechArena discussing composable data system delivery with their Theseus solution. I was ridiculously intrigued by both Josh Patterson’s team’s approach to solving a really challenging aspect of data analysis with acceleration of data preprocessing at scale and the organization’s background in deep knowledge of GPU architectures and accelerated system design. This background has enabled Theseus to deliver an accelerator agnostic framework that not only taps the power of GPUs but enables a host of logic accelerators to work in tandem with high bandwidth memory and fabric technologies to deliver the performance needed as efficiently as possible.
Enter Claypot AI, the company founded by Chip Huyen and Zhenzhong Xu. Claypot is a real time AI solution that will help infuse Theseus with MLOps capability and extend Voltron Data’s offerings to other industry platforms like Apache Arrow, Ibis and Substrait. The technical chops of the Claypot team is impressive. In fact, Chip wrote “Designing Machine Learning Systems” and is a professor on the subject at Stanford. She also has a history at NVIDIA so will bring this background to Theseus development and Voltron Data more broadly.
I can’t wait to hear more from this new collective team as they tackle one of the most daunting challenges of broad AI adoption which is efficient performance delivery of the entire workflow. Watch this space for more!

TechArena host Allyson Klein chats with Unravel Data CEO Kunal Argawal about how his organization is tapping AI to disrupt the data observability arena.

TechArena host Allyson Klein chats with Voltron Data CEO Josh Patterson about delivery of Theseus, a composable data system framework that unleashes develops with new interoperability and flexibility for AI era data challenges.

TechArena host Allyson Klein chats with Ampere Chief Product Officer Jeff Wittich about his company’s progress in winning data center deployments, advances in performance and sustainability, and pushing the limits on core density.

TechArena host Allyson Klein chats with Fortinet security experts Srija Allam and Julian Petersohn about the expansive Fortinet solution portfolio and how the company is leaning into AI to help deliver the protection customers require.

TechArena host Allyson Klein sits down with the marketing co-chairs of the CXL Consortium at SC’23 to discuss the introduction of the new 3.1 spec, the emergence of true CXL 2.0 solutions, and what comes next from this disruptive standard that will re-define data center infrastructure.

I was delighted to catch up with Arm’s Eddie Ramirez at last week’s Open Compute Project Summit and learn about how he sees Arm growing its presence in the data center. I’ve been following Arm’s rise for years, first when responsible for tracking competitive technologies at Intel and later in delivery of some of my first TechArena conversations with Ampere. The value proposition for Arm platforms has always been intriguing given the architecture’s inherent advantages with energy efficiency. With power management becoming a much more urgent concern of operators with rising energy prices and more focus on corporate sustainability challenges, Arm has taken off with cloud service providers. Interestingly, earlier this year we also saw Arm gain great enterprise traction with announcements like Oracle database and SAP Hana support for the architecture.
When I spoke to Eddie, he emphasized the value of the architecture but then went further and pointed to the expansion of heterogeneous computing as an important opportunity for Arm. Arm makes a perfect option for companies looking to deploy accelerator rich platforms as well as a foundation for chiplet configurations offering integrated acceleration, and Eddie noted that he’s hearing about demand in this space and pointed to Meta’s OCP talk as an example.
“Meta said, ‘there isn't a one size fits all solution for AI. For some models, we're going to need a lot of memory. In other models, we're going to need more compute. The space of a one size fits all AI hardware is very difficult to try to crack.’ We love that because at Arm we can give partners the ability to design customized silicon for whatever use case they want to optimize around. These AI accelerators are a perfect example of a broader trend around heterogeneous computing, whether it's in a chip interface, if it's multiple chips on a board or multiple solutions in a rack.”
This observation from Arm is spot on and reflective of a broad trend in the industry around platform customization to address unique workload requirements. What this takes to pull off, however, is enabling an ecosystem of chip suppliers, foundries, and connected technologies to ensure platforms are delivered that work easily for the customer. Enter Arm Total Design, a first of its kind ecosystem that aims to accelerate collaborative design of IP, cores, and custom acceleration for bespoke silicon delivery. There are already thirteen vendors participating in the program with more expected in the months ahead.
As someone who has managed ecosystem programs for silicon innovation in the past, the news of this effort is incredibly interesting as it shows that Arm is serious about leading an entire industry forward based on its architecture and that the technology’s interest is real enough for vendors to be prioritizing engineering cycles for Arm-based innovation. I can’t want to see how this initiative reaps results in the months ahead with customer solution deployment and growing business opportunity for all involved.

TechArena host Allyson Klein sat down with AMD’s Robert Hormuth and Prabhu Jayanna at the Open Compute Summit to discuss the advancements AMD EPYC processors have delivered in performance, performance efficiency and security capability including AMD’s Infinity Guard technology.

TechArena spoke to over 20 industry experts from AMD, Ampere, Arm, Cloudflare, Etho Capital, Fermyon, HED, Intel, Koomey Research, Lemurian Labs, Meta, Microsoft, the Open Compute Project, Oracle, Schneider Electric, Solidigm, and Tenstorrent to publish this comprehensive report on the state of compute sustainability and how organizations should align data center planning and oversight with corporate social responsibility objectives. If you manage an IT organization or oversee data center infrastructure, software, or building management, this report offers practical value for your organization.

TechArena host Allyson Klein chats with Arm’s Eddie Ramirez at the Open Compute Summit about the architecture’s progress in data center, growing a thriving ecosystem, and the sustainability advantages of Arm’s design making it even more attractive to data center operators.

TechArena host Allyson Klein chats with Solidigm’s Roger Corell and Tahmid Rahman at the OCP Summit about their company’s heritage in the storage arena and how their SSD portfolio delivers the performance and efficiency required for the AI era.

This summer, the Ultra Ethernet Consortium was formed by founding members AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft. The charter of this organization is to address gaps in Ethernet primarily for AI cluster requirements. You may wonder why a technology founded decades ago requires this work, but if you look back in the history of high speed fabrics you’ll remember we’ve been here before. InfiniBand entered the scene in the early 2000’s providing mainframe inspired RDMA at high speeds. Soon, the Top 500 supercomputing rankings started featuring the interconnect. Fast forward to the AI era, the convergence of HPC architecture with AI clusters, and NVIDIA’s acquisition of InfiniBand provider Mellanox, and you can see that AI cluster operators today are limited in scale of cluster deployment without the pricey and single vendor solution. Cloud providers want to scale more efficiently, and UEC is looking to solve the final limitations that have not been addressed with previous efforts.
As someone who worked on InfiniBand during the foundation of the specifications, I’ve followed this dramatic narrative more than the average data center geek. While the companies gathered to work on UEC specifications provide great leadership across the value chain, and others from the Ethernet industry have also signaled their support for this initiative, there has been some concern that the religion that comes up about communications protocols could once again limit progress.
This week, however, a further advancement to UEC was reached with a strategic collaboration with the Open Compute Project and the collective data center operators that roam OCP working groups and deploy open OCP spec’d hardware. The two organizations announced a sweeping collaboration across the OCP Switch Abstraction Interface (SAI), OCP Caliptra Workstream, OCP Networking Project, OCP NIC Workstream, OCP Time Appliance Project, and OCP Future Technologies Initiative. I’d also assume that OCP’s sustainability initiative will get involved to ensure the next generation industry standard fabrics that stem from UEC will offer sustainability as well as performance.
When I caught up with Cloudflare’s Rebecca Weekly earlier this week she put this announcement in context. “We need massive clusters for AI models. I think one of the big themes here is around ultra ethernet, the consortium that has just been established in the last few months to really start to drive ethernet as primary. That is a huge step forward. I would argue for the ecosystem to be able to drive interconnected systems to get us from a couple hundred accelerated nodes operating together to actually millions of accelerated nodes operating together. That is a connectivity first problem.”
The TechArena take: we’re excited to see this collaboration as it will accelerate time from specification to solutions and hopefully prevent forking of vendor solutions from standards. We also see this as a widening of cloud operator blessing on UEC signaling interest in using Ethernet for massive scale fabrics. This puts UEC on technologies to watch heading into 2024.

I was excited for the WEKA presentation at Cloud Field Day, and the WEKIES did not disappoint. Launched in 2017 at reInvent, WEKA provides data services aimed at the HPC cloud services whether in public cloud or within an on-prem private cloud. Their customers are scientists exploring anything from drug discovery and genome sequencing to advances in autonomous transportation and electronic design automation. And yes, their solutions extend to AI as many of the infrastructure principles from HPC have formed the foundation for tightly coupled AI clusters.
When you consider optimization of an AI data pipeline, each stage of the pipeline introduces different challenges for data storage. WEKA has unified this optimization through a single software approach for each stage and further optimized for the different IO characteristics across AI models.

WEKA discusses super efficient HPC storage optimization
WEKA runs in across the four major cloud players, and the performance is eye opening – 5 million IOPs in AWS, rendering 120 fps from the cloud, and 40X faster model deployment, 2 TB/s of data in OCI, and 10X faster research. This performance also results in cost savings in reduction of duplicate copies and pay for what you use storage. This also is estimated to have saved a collective 260 million tons of carbon emissions for WEKA customers.
They deliver this in part by modernizing storage tiering by keeping metadata in the flash tier and chopping large files into small objectives and packing up tiny files into larger objects. They also pre-stage data on the SSD while they’re waiting for current process to complete. Object storage is also kept on hotter tier SSDs until space is needed to lower holistic latency and improve application performance. They walked through an S3 example but apply similar approaches across public offerings.
WEKA also drives data portability within hybrid and multi-cloud environments. They’re doing this by pulling data from on-prem, snapping to object storage, encrypting the data, and moving it to the public cloud. This is also done for bursting where analysis is delivered in the cloud and data is then moved back on prem so that public services can be spun down.
The WEKIE team also discussed zero footprint storage, or running applications and WEKA on the same servers to deliver cost savings and value at scale to the customer. This is employed within GPU farms where IO resources are under-tapped and in other hardware resources with excess IO capacity.
TechArena’s take: cloud providers are desperate to drive AI clusters to higher levels of performance and efficiency. That was the #1 topic at OCP this week, and it’s no surprise that all of the big players have integrated WEKA into their offerings. With more organizations moving HPC workloads to the cloud and more enterprises expanding their AI workloads in the cloud when they can get the GPU cycles, WEKA has an opportunity for major growth in the days ahead. I’m most intrigued by the converged mode zero footprint solution as an innovative path to increased resource and cost efficiency. I’ll be watching this space.

AMD CVP Robert Hormuth was on hand at Cloud Field Day to share his views on cloud native computing and how his architect team at AMD has tackled EPYC processor development to deliver the right performance for cloud native environments. Why would architecture changes matter for an operational computing model? Robert laid out his case by discussing the differences between traditional monolithic computing to virtual/Iaas, Containerized/Caas, and Functionize/FaaS workload management. Each of these stack changes drove underlying changes to the processor architecture. The virtualization era drove higher memory capacity, a move to multicore architectures and some intra VM data locality. In moving to containers, we say higher core counts and increased memory BW, a great reduction in data locality, and more stress on IO performance. Finally, functions have called for maximization of core count and a state of short-lived workload existence.
What has AMD done to address this? First, AMD placed priority on innovating for the future of cloud native workloads over porting the past. With this comes a shift from VMs sharing core hypervisor services across all platform services to the concept of Dom0, a thinner hypervisor interacting with Dom0 services running in a single VM and run on a DPU or smartNIC. The AMD DPU, called Pensando, is integrating high speed packet processing, CPU slow path processing and backside IO to flash, NVMe, Drives, DRAM etc. This buys you more tenant instances, increased enabled applications in the cloud for determinism and performance, consistent management across environments, centralization of security and abstraction of HW+SW services to any host CPU and bare metal OS.
This also comes with a focus on chiplet architectures and acceleration of innovation past Moore’s Law. AMD’s chiplet design enables product innovation mixing and matching cores and architectural features based on different workload requirements as represented by 4 different EPYC product lines under the 4th generation of processor. These chiplets also lower cost of design with incredible flexibility compared to monolithic design. Chiplets with 2.5D and 3D stacking also increase silicon real estate delivering higher silicon area on package and ultimately more cores per CPU and memory to the customer.
So what about CPU acceleration? AMD strongly believes that tenants will want to utilize off the shelf code that have not been optimized for workload acceleration, and that operators benefit not from offering tenants workload acceleration but accelerating core functions through the DPUs and providing high performance standard cores for tenant workloads. This is an interesting view given AMD’s primary competitor has loaded their CPU with 14 workload accelerators.
The TechArena’s take is that both companies are right. Some savvy enterprises will optimize their code to take advantage of workload acceleration, and some of them will run these strategic apps in the cloud. The majority of enterprises do indeed want to run off the shelf code, and AMD’s bet on simplified high-performance cores will pay off over alternatives. Cloud operators likely will continue offering a choice in infrastructure instances to address both customer markets. AMD Bergamo offers 128 core scaling that puts it in the lead for pure performance, and because of this cloud native workloads especially in containerized or functionized environments are ideally suited for these platforms.

TechArena host Allyson Klein chats with Cloudflare VP Rebecca Weekly at the OCP Summit on far ranging topics including the demands of AI on infrastructure, how Summit announcements will shape the industry, and the importance of modular and circular infrastructure oversight.

TechArena host Allyson Klein talks to Intel’s Eric Dahlen, Microsoft’s Shruti Sethi and Schneider Electric’s Alex Rakow about the advancement of OCP’s sustainability initiative in 2023 across embodied carbon in silicon, circularity, and getting beyond PUE.

At Cloud Field Day today, the team from Juniper Networks delivered their vision for how they’re assembling the right technology, partnerships and embrace of industry standards for AI era data center networking requirements. Their discussion focused across operations, openness and solutions and was very much rooted in the recent acquisition of the Apstra technology, network automation software that aims to simplify and scale network management for IT operators.
Juniper detailed how it sees cloud style on-prem networking. Chris Magret detailed the complexity of managing across the alphabet soup of network solutions and made a case for x-vendor oversight with Apstra with Terraform. Chris walked through a demo of configuring a small LAN of Juniper switches and showed how to establish an Apstra routing zone, services network and VLAN through educating Apstra on network topology, Apstra’s automated discovery of network resources including spine and leaf switches as well as server nodes into a blueprint, and automating the configuration of all switches in the network. Chris went further by deploying an application into the fabric using Apstra. The key point of this exercise is demonstrating a cloud native management interface saving IT complexity, time and required skill for network oversight.
We next dove deep into AI cluster design and the specific challenges that AI training places on networking with James Kelly. James discussed GPU fabric rail-optimized design delivering a flatter network topology in groups of eight leafs, called a stripe by Juniper. This localizes traffic and keeps as much traffic as possible off of spines while scaling the network to support over 18K GPUs within the cluster. Can you go to a super spine configuration to scale further? James argued against it given the additional complexity and latency of such a configuration.
Finally, we looked at network analytics and how Juniper tools help network operations administer given new network requirements. Kyle Baxter walked us through how telemetry data is ingested into Apstra using Juniper’s custom telemetry collector aimed at IT admins. The telemetry collector offers a code free way to oversee the network in flight which also integrates an option for command line interface control. The key takeaway was that continual telemetry data feeds across network nodes provides Apstra’s ability to maintain management of all network resources while providing administrators the tools to sort data into meaningful information about the network.
The TechArena takeaway?
Apstra is a fantastic jump forward in simplifying network management, and we’re keen to see how Juniper continues down the path for true multi-vendor support across network infrastructure providers. We’re curious if Juniper sees Apstra only as a companion product of Juniper switch engagements or as a true standalone management console regardless of switch deployment. We’re also excited to see more progress from Juniper and the entire ethernet ecosystem with the advancement of ethernet for AI clusters including movement from the Ultra Ethernet Alliance and its new collaboration with the Open Compute Project.

TechArena host Allyson Klein chats with Fermyon CEO Matt Butcher about cloud redundancy’s impact to sustainability, and how his organization is delivering new serverless AI capabilities help usher in a more sustainable computing future.

Andy Bechtosheim is a legend of the computing industry. Currently at Arista Networks, Andy is arguably best known for founding Sun Microsystems but also founded Granite Systems, a gigabit ethernet startup acquired by Cisco, and of course Arista. We was also a key founder of the OCP organization. He also is known for being a savvy silicon valley investor including one of the earliest investors in Google funding Larry and Sergey before they’d incorporated. Depending on who you want to believe, he also encouraged them to name the company Google. He has received the Smithsonian Leadership Award for Innovation, is an elected member of the National Academy of Engineering, as was recognized as the person who has delivered the most to server innovation over the past 20 years by IT Pros. He has a knack for understanding deep technology as well as being on the cusp of what’s next, which is why I prioritized his talk on the AI Data Center at OCP Summit.
Andy started his talk today identifying that generative AI performance requirements are growing 10X per year. Where does a 100X performance breakthrough come from over the next few years? Andy called out architectural improvements, next gen processor advancements, optimized number representations including microscaling formats, and higher memory bandwidth among key elements for the industry to focus. Process technology alone will not get us there as the industry moves from 5nm to 3nm and 2nm technologies. 2.5D and 3D packaging technologies will help as available silicon real estate will basically double. Together, these represent at 5X performance optimization…not enough.
Andy suggested that doubling power and improving cooling will be necessary to drive performance required. He laid out a vision for a 3000W GPU chip representing historic challenges for cooling technologies. He introduced a focus on diamond substrate technology (DST) 5X better conductor than copper leading to a reduction of hot spots on a die and more effective area for liquid cooling technologies to be effective. So what cooling is required? Liquid cooled data centers are the new design focus due to AI. Andy called for a holistic design approach from chip to rack to building.
He introduced the Open AI rack GPU/CPU configuration that delivers 100-200 KW/rack supporting 16 2U GPU blades with supporting switch and power shelves. The practical limit for air cooling, for contrast, is 50 KW using heat exchanger technology. The fabric inside the rack utilizes passive copper cables offering the lowest power and high reliability. Andy believes that this rack technology is the future for AI clusters and called for the industry to standardize this specification so that engineering development is not fragmented over redundant solutions.
But what if folks want to go farther to 200 KW? And they do…HPE/Cray has a 200KW system on the market today as one example of industry delivery at higher power. For this, immersion cooled racks will rule the day. All heat in this case is removed from immersion liquid. Here, the industry is still not settled on the liquid to be used between water, mineral oil and fluorinert. Water would be preferred, but there are challenges with corrosion, bacterial introduction, and ongoing water chemistry maintenance required. Two phased cooling using alternative liquids can avoid pumps or ongoing maintenance for antibacterial or anticorrosion treatment. What does this mean for IT operators? Andy called for more standards to come into play as the industry is limited to bespoke solutions that will not achieve the scale the requirements for our impending AI era.
The TechArena’s take from this rapidfire talk at OCP Summit: the fact that Andy focused on something as esoteric as liquid cooling underscores the pending challenge for performance scaling for AI. The ongoing death of Moore’s Law is driving up energy draw and heat from server silicon, and processing requirements of AI are growing at the same time at unprecedented rates. Expect to see rapid innovation in the liquid cooling arena, and look for OCP’s sustainability initiative to drive a wedge into standardization efforts in this arena.

TechArena host Allyson Klein chats with Meta and OCP lead Dharmesh Jani about his vision for sustainability innovation for the data center and how the OCP sustainability initiative came into being.

TechArena' host Allyson Klein chats with Oracle executive Bev Crair on how she’s tackling long term strategic planning for OCI to get ahead of where the market and customers travel and how this approach informs Oracle business.