
Runpod head of engineering Brennen Smith joins a Data Insights episode to unpack GPU-dense clouds, hidden storage bottlenecks, and a “universal orchestrator” for long-running AI agents at scale.

In HPC circles, the discussion often revolves around achieving “peak FLOPS.” And while compute is unquestionably important to high performance computing, many HPC applications including scientific simulation, graph analytics, finite element modeling are actually memory-bound. Compute waits for data. That’s where Xeon 6 processor architecture, and it’s support for scalable memory, shines.
I recently met with a national lab team whose spectral simulation scaled beautifully to 128 nodes, but performance flattened beyond 128 because memory bandwidth per core collapsed. Their compute was starving, waiting for data that was slow to arrive. While it would be tempting to throw more flops at the problem, the more elegant solution to their compute constraint was delivery of a balance of compute, interconnect, and memory performance.
Luckily, we’ve designed today’s Xeon CPUs for this balanced performance. In the Xeon 6 processor family, the architecture supports both P-cores (for compute) and E-cores (for throughput) on a unified I/O and memory interface, providing computing environments found in the HPC arena the flexibility they crave. That shared fabric ensures that memory bottlenecks don’t isolate execution to narrow lanes, delivering data effectively to meet compute requirements.
How do we accomplish it? It starts with high throughput memory channels and large cache hierarchies, essentially reducing memory contention within the system. We extend this with a NUMA design, carefully tuned, ensuring that parallel tasks see minimal cross-node memory latency. We layer on low-latency coherence paths, essential to multi-socket configurations found in HPC platforms, and multi-workload support mixing compute, data staging, and I/O including checkpointing and data orchestration. This removes the opportunity for non-compute tasks to gate total workload performance.
Do Xeon 6 processors make sense for your HPC configuration? Evaluate your workload requirements, and assess if you’re gated by memory-bound application constraints. I think you’ll discover that the balanced system performance delivered by our latest generation of processors can change the game for compute delivery, making them a solid foundation for HPC deployments.

Self-healing has long been Kubernetes’ north star: restart failed pods, reschedule workloads, reconcile desired state, and keep applications running through everyday failures. But AI is piling on new pressure as teams run GPU-hungry models, mix batch and real-time inference, and stretch Kubernetes across fleets of clusters and clouds. At enterprise AI scale, that pressure lands on site reliability engineering (SRE) and platform teams, who have to reason about GPU scarcity, token volume, spiky inference, and large Kubernetes fleets all at once.
Hundreds of clusters, thousands of services, and a flood of change create too many variables for humans to chase in real time. The question is no longer whether Kubernetes can restart a pod; it’s whether platforms encode SRE judgment into systems that act quickly, safely, and with an audit trail on their behalf.
At KubeCon + CloudNativeCon North America in Atlanta this week, chatter about “agentic SRE” could be heard up and down the massive showroom floor of the Georgia World Congress Center; meanwhile, the Cloud Native Computing Foundation (CNCF) unveiled its new Kubernetes “AI Conformance” push during the opening keynotes. Jonathan Bryce, executive director of cloud and infrastructure at CNCF and Chris Aniszczyk, CNCF chief technology officer, opened the sessions with a call to action to make AI workloads portable and interoperable across platforms, just as conformance once standardized Kubernetes itself for every major cloud service or private cloud option.
Against this backdrop, I spent a few days talking with exhibitors about agentic SRE and how AI is fundamentally changing Kubernetes operations.
Devtron, Komodor, and Dynatrace are each coming at that problem from a different angle. Devtron is collapsing application, infrastructure, and cost into a single view and layering in an agentic SRE interface. Komodor is turning static runbooks into a policy-scoped, multi-agent SRE that can self-heal fleets and even live-migrate pods off spot instances. Dynatrace is pushing observability from dashboards into decisions while asking whether AI is actually earning its keep.
Taken together, they sketch an ops layer that looks a lot like what AI Conformance is aiming for at the platform layer: standard patterns for how AI runs, heals, optimizes, and proves value on Kubernetes—without treating AI infrastructure stacks as fragile, one-off, bespoke environments that each need custom care and feeding.
Devtron, an enterprise open-source Kubernetes management platform with more than 21,000 installations powering over 9 million deployments, launched Devtron 2.0 during Day 1 of the convention. The release adds an ‘Agentic SRE’ layer on top of its existing footprint to bring AI-powered autonomous operations to production Kubernetes that has to withstand catastrophic failures, ransomware, and high-availability demands at scale.
Devtron 2.0 starts with a very human problem: operators drowning in tools and organizational lines between applications and infrastructure. I chatted with CEO Ranjan Parthasarathy, who said the company wants to “simplify the lives of operators who are running Kubernetes in production.”
“Managing Kubernetes in production is challenging because, first of all, there are too many tools,” Parthasarathy said. “Second of all, there is a very clear line that separates applications and infrastructure management.”
Devtron 2.0 explicitly mimics Kubernetes’ own design.
“We have taken the approach Kubernetes took from day one, which is, they blurred the lines between app and infra in how Kubernetes is architected,” he said. “The APIs for app and infra are all the same. The way you capture app and infra in the form of manifests is all the same. So, why should manageability create an artificial separation?”
Devtron’s answer is a single environment where you can follow a problem from logs to infrastructure to cost without hopping through a half-dozen consoles, with integrated FinOps and GPU visibility so AI workloads are first-class citizens in that view.
According to Devtron, customers like BharatPe and 73 Strings are already using the platform to shrink release cycles from months to weeks and cut mean time to recovery from days to under an hour, which is the backdrop for everything Devtron is now doing with agentic SRE. Their agentic SRE layer walks the classic maturity curve: start with safe reads, then layer in human-approved changes.
“Explain is a feature that we have in our UI at select strategic places where, the minute an error happens, the user can say, ‘Explain.’ And it explains in human readable form what really happened,” Parthasarathy said.
The system also drafts remediation actions that humans review and, once tested in the wild, can bless as auto-apply for recurring conditions. The agent is more like a calculator than a replacement, Parthasarathy said.
Komodor is the autonomous AI SRE company for cloud-native infrastructure and operations. At KubeCon, the team highlighted new autonomous self-healing and cost optimization capabilities powered by Klaudia, a purpose-built agentic AI system that sits on top of Komodor’s existing Kubernetes troubleshooting platform.
Klaudia—a multi-agent system—sits closer to the hands-on-the-keyboard side of SRE. The company has run a Kubernetes troubleshooting platform for years; now they’re releasing an additional agentic AI layer on top of it, said Udi Hofesh, who works in product marketing and developer relations for the company.
“This enables the same great value autonomously, basically saving more time and providing more accurate, more expansive insights and recommendations,” he said.
The core idea is to turn Kubernetes’ reconciliation model into practical self-healing at fleet scale. Komodor’s media release about Klaudia leans hard into the scale of that problem. They cite industry data showing that 88% of technology leaders report rising stack complexity and cloud waste often exceeds 30% of total spend when misconfigurations and idle capacity linger. In one Cisco environment, the company says Klaudia helped cut ticket volume by roughly 40% and accelerated mean time to recovery by more than 80%.
“Kubernetes works and is built around reconciliation,” said Mickael Alliel, backend tech lead at Komodor.
Klaudia’s policies are designed to reconcile the workloads and applications of Komodor’s customers to always be healthy and in a working state, not just fire off static runbooks. That dynamic behavior is the key difference from traditional automation.
“The automatic runbooks or playbooks are, let’s say, something that doesn’t change,” Alliel said. “Klaudia, as the autonomous AI SRE, is able to do it a lot more dynamically… it acts as a real site reliability engineer as opposed to just a series of steps.”
With graph-wide context, Klaudia can pull telemetry from multiple namespaces and components and “get a root cause analysis up and running in as little as 15 or 30 seconds,” he said, which matters a lot when you’ve got one SRE for dozens of teams.
Guardrails are a big part of the story, especially for teams burned by LLM hallucinations.
“We actually try to enforce on Klaudia and the AI SRE as many safeguards as possible to ensure that the AI doesn’t hallucinate,” Hofesh said. “We try to ensure that, if it doesn’t know something, it will say. ‘I don’t know and I need more information’… instead of just spitting out something that is not true.”
Every action is logged, and an SRE can see both a full summary of all the actions that Klaudia has taken and the reasoning behind them.
“We gave (Klaudia) a name and a face, but it’s actually hundreds of agents that are interacting with each other,” Hofesh said. For each component in the cloud-native stack, there’s a domain expert agent, orchestrated by workflow agents that mimic SRE motions like detect, investigate, optimize.
Beyond incident response, Klaudia also pushes into cost optimization. Komodor is using it to dynamically right-size workloads, schedule pods to avoid idle resources and bin-packing dead-ends, and use their PodMotion capability to move pods and state across nodes with zero downtime so teams can chase cheaper capacity or handle infrastructure events without disrupting applications.
Dynatrace is an AI-powered observability and security platform that unifies application, infrastructure, log, and business data in a single data lakehouse and uses its Davis AI engine to turn that telemetry into real-time insights and automated remediation. It has had AI in its stack for more than a decade.
“We have been working in the AI space for over 12 years,” said Chief Technology Strategist Alois Reitbauer. “We were always the odd people out—the people doing stuff with AI for a very long time. Not generative AI, but AI in general, and machine learning. We use predictive AI to predict behavior and detect anomalies, then use causal AI to understand the root cause of a problem, to understand cause and effect.”
What’s shifted recently is the focus of AI observability. Early on, he said, it was about tokens and performance. Now that more systems are in production, the question has become whether it provides value.
“It’s not just, ‘How much money are we spending,’ but, ‘Do people actually get something out of it? Should we keep investing into it?’” he said.
Reitbauer pointed out that AI budgets aren’t created from thin air. He explained that, as companies move investment to AI from other areas, they have an expectation of ROI that is at least as high, if not higher, than before. He gave an example of a website that offers a product for $3, and pays $5 to generate the recommendation; not exactly a model of ROI.
On the plumbing side, he described observability’s progression from collecting data, to anomaly detection, to root-cause analysis and now to action.
“We’re moving into the next generation where tools are actually able to take action,” he said. Instead of just saying “your system is down, your servers are overloaded,” a next-gen system might say: “Your system is down, your servers are overloaded. I propose an immediate mitigation action to scale up from three to five servers… and I already created the PR, just click approve here.”
Long term, it can also surface proposals for the developer on how that code could potentially be rewritten to be more efficient.
Dynatrace’s internal agentic platform is wiring those pieces into workflows, Reitbauer said.
“Think of it as a low-code way of building an agent, almost,” he said.
The use cases line up with the themes from KubeCon: remediation workflows based on observability data, preventive workflows that reconfigure environments before trouble hits, and continuous optimization tuned to cloud environments.
Kubernetes AI Conformance is about making AI workloads on Kubernetes interoperable and portable across a messy mix of models, frameworks, and hardware.
The companies I talked with are doing the same thing for operations: turning AI-heavy Kubernetes environments from bespoke into systems that can be monitored, healed, optimized, and justified at scale.
The interesting advances aren’t the fully autonomous slogans; they’re the boring-but-essential scaffolding behind them. These platforms are also shipping against real-world pain. Devtron points to customers like BharatPe and 73 Strings using its unified control plane to shrink release cycles, improve stability, and drive MTTR down from days to under an hour. Komodor cites Cisco’s platform engineering team cutting ticket loads by around 40 percent and improving MTTR by more than 80 percent as Klaudia moves from reactive triage to proactive self-healing and optimization.
Devtron’s merged application/infrastructure/cost view and human-in-the-loop agent treat autonomy like a calculator, not a replacement. Komodor’s domain-specific multi-agent approach attacks both the incident math and the spot-capacity economics. Dynatrace is pushing observability from a passive system of record toward an active participant that can propose or trigger changes—and then tie those moves back to business outcomes.
If Day 1 in Atlanta was about putting a floor under AI workloads with Kubernetes AI Conformance, these conversations were about building the mezzanine above it: how AI actually runs and proves itself in production. The self-healing promise of Kubernetes isn’t going away; it’s being reimplemented at the organizational layer—across clusters, costs, and teams—so platform and SRE leaders can keep up with AI-era workload diversity and autonomy without scaling humans linearly with every new deployment.

MLCommons’ MLPerf Training v5.1 lands with three clear signals: generative AI continues to shape the benchmark mix, scaling discipline is where many teams are winning time, and the roster of credible submitters is getting broader. This round includes 65 unique systems from 20 organizations spanning silicon, systems, clouds—and, notably, an academic HPC center.
“The view I see in MLPerf is like a Formula One race—same track and rules, room for tuning, and you see who can finish fastest,” said Chuan Li, Chief Scientific Officer at Lambda, who led the company's MLPerf v5.1 efforts. It’s a clever encapsulation of MLPerf’s value proposition: standardized tasks and target quality keep the contest honest while leaving space for technique.
What’s new in v5.1 is squarely aimed at today’s workloads. Llama 3.1 8B replaces BERT for LLM pretraining—a modern, decoder-only architecture that fits on a single node (≤8 accelerators) yet mirrors software patterns used at larger scales. On the image side, Flux.1 replaces Stable Diffusion v2, reflecting the shift to transformer-based diffusion models with cleaner validation via loss. Together, these swaps align the suite with the stacks enterprises are actually deploying.
Momentum and scale show up in the submission patterns. Multi-node entries climbed sharply versus a year ago, and genAI tests drew heavy participation: Llama 3.1 8B debuted with strong interest, while Llama 2-70B LoRA continued to be a favorite fine-tuning proxy. Performance trends outpaced a simple Moore’s Law line again; gains came not just from fresh silicon but from numerics, software, and fabrics—exactly the kind of full-system work buyers need to see.
NVIDIA posted most of the fastest times and largest-scale runs this round—especially on GB200/GB300 NVL72 configurations—reflecting stack maturity and intra-rack NVLink scale. Still, the broader story is ecosystem momentum: new entrants, academic participation, and software/networking gains that turned more multi-node runs into reproducible, closed-division results.
• University of Florida (academic first-timer): Ran across seven benchmarks on HiPerGator DGX B200, including multi-node scaling to 448 GPUs, demonstrating closed-division reproducibility on a shared HPC environment.
• Wiwynn (platform new entrant): Kinabalu posted Llama 2-70B LoRA results at 72 and 576 GPUs on GB200 NVL72, signaling readiness of its NVLink-centric design for fine-tuning workloads.
• Datacrunch (cloud first-timer): Brought up an 8× B200 Llama 3.1-8B run via Slurm/Pyxis “Instant Clusters,” positioning for fast, reproducible re-runs rather than one-off hero numbers.
Precision and practicality deserve a note. Several submitters leaned into lower-precision training (FP8 → FP4 variants) where numerically stable, but MLPerf’s rubric keeps that grounded: time-to-target-quality forces any optimization to actually converge. The other big lever is networking and topology—RDMA over InfiniBand or tuned Ethernet, clean hierarchies, and reliability at scale—because eight nodes only help if they act like eight, not three.
Lambda was one of a short list to post on GB300 NVL72 (72 Blackwell Ultra GPUs in a single NVLink domain). Two takeaways surfaced in my side interview with Chuan Li. First, the speed-ups split roughly half-and-half between hardware (more memory, higher inter-GPU bandwidth) and software (driver/library/framework maturation).
Second, numerics helped at the margin: moving from FP8 to an FP4 variant delivered an additional double-digit percentage improvement while still meeting the accuracy target. There’s also a practical lesson here: clean, converged runs at the edge of scale require weeks of lined-up capacity and tight coordination across DC ops, fabric, and software. Useful proof point—one of 20, not the whole story.
“We saw a 1.66x speedup in our Llama 2-70B run compared to previous submissions,” Li said. “This performance improvement showcases the power of the latest NVIDIA hardware, combined with Lambda’s cloud orchestration capabilities.”
If you’re using v5.1 to guide purchasing or platform bets, a few simple rules help:
• Compare like for like. Start within the same benchmark and similar accelerator counts. A 32-GPU result and a 512-GPU result are not interchangeable.
• Look for scale curves, not just a single number. Do you see near-linear improvements from 8 → 16 → 32 → 64 GPUs? That often tells you more than a hero time.
• Check the software notes. Frameworks, kernels, parallelism strategy, IO/storage, and data pipelines are where much of the delta lives—and MLPerf links to them.
• Use Llama 3.1 8B as a quick stack sanity test. It’s modern, single-node accessible, and a good proxy before you commit to larger spend.
• If you care about image generation, Flux.1 is the new reality. Expect different stress points than SD v2 (attention/memory/diffusion schedule) and plan tuning accordingly.
• Treat FP4 wins as conditional on convergence. The fastest path that misses target quality doesn’t count in MLPerf—and it shouldn’t in production, either.
This round isn’t just about newer GPUs; it’s about maturing engineering. The two new tests (Llama 3.1 8B and Flux.1) meet the moment for enterprise Gen AI, and the influx of first-time submitters expands the set of credible places to run—from platform OEMs to an academic HPC center to nimble clouds.
As organizations continue to push the boundaries of AI infrastructure, the industry is seeing an acceleration of hardware, software, and networking innovations that are making frontier AI models more accessible and deployable at scale.
As AI infrastructure evolves, MLPerf Training provides a vital benchmark for the industry, ensuring that progress in AI development is transparent, reproducible, and measurable.

Cloud-native isn’t contracting—it’s climbing up the stack. The Cloud Native Computing Foundation’s (CNCF’s) latest State of Cloud Native Development—done in partnership with SlashData—shows the community expanding beyond traditional Kubernetes operators into a much wider slice of backend developers who may never touch cluster primitives directly. That shift explains why some dashboards show container/Kubernetes “usage” leveling off even as cloud-native grows overall: the interface is moving up a layer to internal developer platforms and opinionated tooling.
“Cloud-native is moving from being a tech stack to a cultural shift in how developers interact with infrastructure,” said Bob Killen, senior technical program manager at CNCF. “It’s about empowering teams to build on top of a flexible, standardized foundation, not just running workloads in containers.”
CNCF and SlashData estimate 15.6 million developers now qualify as cloud native, about 32% of the global developer population, with roughly 9.3 million in the traditional backend segment. Among developers who work on backend services, 56% are cloud native in Q3 2025—up from 49% in Q1 2025. Hybrid-cloud deployments climbed from 22% in early 2021 to 30% in Q3 2025, and multi-cloud sits at 23%. Meanwhile, only 41% of professional machine learning/artificial intelligence (ML/AI) developers identify as cloud native—likely because many consume AI via managed endpoints that abstract away the stack.
Killen described the pattern plainly in our interview: many backend developers now deploy through internal platforms like Backstage and other dev-portal tools rather than touching containers or Kubernetes directly. That doesn’t reduce the relevance of Kubernetes—it elevates it and makes it even more accessable. Teams “build once” to Kubernetes and point workloads to wherever capacity and cost line up, on-prem or cloud, without re-plumbing their developer workflow. This is the portability dividend the ecosystem bet on a decade ago.
“While AI/ML developers have infrastructure-heavy workloads, many don’t identify as cloud-native developers because they’re interacting with the infrastructure through abstracted layers like managed endpoints,” he said.
Hybrid-cloud’s steady rise isn’t a fashion cycle; it’s economics and capacity. GPU availability, compliance posture, and data-gravity considerations favor a mixed estate: local clusters for steady-state workloads, burst capacity in public clouds when queues spike, and selective use of specialized GPU instances for inference. The report’s trendline from 22% hybrid in 2021 to 30% in 2025 tracks what we hear from platform teams: design for flexibility first, then optimize per workload.
The CNCF and SlashData Tech Radar Report, which surveys what tools developers are actually using and recommending, points to a few emerging patterns:
Here are a few observations from the CNCF/SlashData State of Cloud Native Development report:
1. Design Attention Is Moving to the Portal Layer: With 77% of backend developers using at least one cloud-native technology while many don’t identify as “Kubernetes users,” the center of gravity appears to be shifting toward internal developer platforms. Cost, performance, and security signals are increasingly surfaced in portals rather than in cluster-level tools.
2. Hybrid/Multi Is Becoming a Steady State: The report shows hybrid usage at 32% and multi-cloud at 26% among backend developers, with distributed cloud at 15%. Taken together, those shares suggest multi-venue deployment is becoming routine rather than exceptional, with Kubernetes serving as the portability layer across environments.
3. AI Plumbing Is Consolidating Around a Few Stacks: Many AI teams still consume managed endpoints, but the Technology Radar highlights a narrowing set of building blocks: Triton/DeepSpeed/TF Serving/BentoML for inference, MCP/Llama Stack for agentic scaffolding, and Airflow/Metaflow for orchestration. The pattern suggests a pragmatic core is emerging inside otherwise varied AI pipelines.
Only 41% of professional AI/ML developers are counted as cloud native in the study. That doesn’t mean they aren’t running on cloud-native infrastructure; it means consumption is often through higher-level SaaS or managed services where the platform owns the runtime. As more teams bring inference and retrieval closer to their data for cost, latency, or privacy, expect that percentage to rise—especially as internal developer platforms (IDPs) make “cloud-native-by-default” the path of least resistance.
Two mechanics will shape 2026 roadmaps.
Cloud-native isn’t fading; it’s moving up the stack. The center of gravity appears to be shifting from cluster primitives to internal developer platforms. Kubernetes continues to function as the portability layer, while more developers interact through portals and opinionated tools rather than directly with containers.
Hybrid and multi-cloud usage looks less like an edge case and more like standard operating context. The data suggests routine use of multiple execution venues as organizations balance capacity, cost, and locality considerations over time.
Developer sentiment around inference engines (e.g., Triton, DeepSpeed, TensorFlow Serving, BentoML), agentic scaffolding (MCP, Llama Stack), and orchestration (Airflow, Metaflow) points to a pragmatic core of components coalescing inside otherwise diverse AI pipelines.
Across interviews and releases, “agentic SRE” is taking shape as a layered pattern: explain-and-observe capabilities first, human-reviewed changes next, and policy-scoped autonomy for recurring fixes. Notable strides include transparent reasoning, auditable actions, and domain-scoped agents aimed at reducing error surface.
Two advancements stand out: platform-level immutability for backups that treats ransomware recovery as table stakes, and live container migration aimed at maintaining long jobs on ephemeral capacity. Both represent meaningful steps toward reliability at fleet scale without sacrificing economics.

Allyson Klein talks with author and Google/Intel alum Wanjiku Kamau on moving past AI skepticism, learning fast, and using new tools with intention—so readers start where they are and explore AI with hope.

The energy was palpable across the Georgia World Congress Center in Atlanta this morning as 9,000 people gathered for the 10th annual KubeCon + CloudNativeCon convention, where the Linux Foundation announced a brand new Kubernetes AI Conformance program, a community-driven certification aimed at making AI workloads portable and interoperable across Kubernetes platforms.
The opening keynotes drew a packed house and delivered a clear message: the next decade of cloud native will be defined by how well this community standardizes AI at scale.
It’s a fitting inflection point. This year marks the 10-year anniversary of the Cloud Native Computing Foundation (CNCF), and the program’s journey from a handful of seed projects to a global, high-velocity ecosystem is the backdrop for what comes next.
The CNCF launched in 2015 under the Linux Foundation to steward a new operational model built around containers, orchestration, and declarative automation. The first CNCF Board meeting took place that December at The New York Times offices, and by March 2016, the Technical Oversight Committee had formally accepted Kubernetes as the foundation’s first project. Ten years later, the numbers tell the story: nearly 300,000 contributors across 190 countries have pushed 18.8 million contributions into more than 230 projects. The once-compact cloud native landscape now spans everything from core orchestration to observability, service meshes, security, data, and developer experience.
That community scale shows up in the audience, too: roughly 48 percent of attendees are first-timers, a reminder that cloud native keeps onboarding new builders even as it professionalizes.
The membership base has swollen from 22 founding organizations to more than 700 member companies—platinum and gold vendors, a deep bench of silver members, and a growing cadre of end-user organizations that help steer real-world priorities. A new platinum end user, CVS Health, was announced on stage, underscoring how cloud native has moved well beyond hyperscale tech firms into heavily regulated, mission-critical industries.
“The two most significant trends are merging right now—cloud native and AI are not separate technology trends; they are really coming together,” said Jonathan Bryce, executive director of cloud + infrastructure at the Linux Foundation.
That was the through-line from the main stage this morning. CNCF leaders framed AI in three layers—training, inference, and applications/agents—and called out inference as the near-term hotspot. The scale is staggering: Google said its systems jumped from about 980 trillion tokens per month to roughly 1.33 quadrillion tokens per month in just a few months, and every large enterprise is now under pressure to stand up reliable, cost-efficient AI services—not just proof-of-concepts.
To meet that moment, CNCF introduced the Kubernetes AI Conformance program, a community-driven certification aimed at making AI workloads portable and interoperable across Kubernetes platforms. Platforms that earn AI Conformance are expected to meet concrete requirements across six pillars:
Accelerators: hardware abstraction and scheduling for GPUs/TPUs and other accelerators (built on capabilities like Dynamic Resource Allocation, which graduated to GA in Kubernetes 1.34).
The original Kubernetes Conformance program is one of the quiet reasons cloud native scaled: it gave buyers confidence that distributions wouldn’t drift and that workloads would behave predictably across environments. AI needs the same discipline. Without it, teams get trapped in bespoke integrations, vendor-specific quirks, and fragile pipelines that are hard to operate at scale.
A live demo on stage walked through what an AI-conformant cluster looks like in practice: using DRA to discover accelerators and define resource plans; deploying a vision-language model; scraping model metrics; autoscaling via custom metrics; and exposing accelerator telemetry such as utilization and temperature. The point wasn’t the specific model—it was the proof that a consistent, open set of platform guarantees shortens the path from “it runs” to “it operates.”
Initial participants shown on the keynote logo wall include hyperscalers, enterprise platforms, and AI infrastructure providers such as Google Cloud, Microsoft Azure, AWS, NVIDIA, Red Hat, Oracle, SUSE, SAP, Akamai, Alibaba Cloud, Broadcom, CoreWeave, DaoCloud, and Kubermatic, among others. Expect that roster to grow quickly as vendors align their roadmaps and customers start asking for the badge.
By defining a minimum common denominator for accelerators, security, scheduling, observability, and operators, AI Conformance gives builders a stable target and gives organizations a portable operating model. Vendors can innovate above the line; users get fewer surprises when they move from lab to production or from one environment to another. It’s exactly the kind of boring, essential plumbing that lets the more exciting parts of AI—faster models, better retrieval, smarter agents—ship without reinventing the platform every time.
CNCF’s latest developer data puts the cloud-native population at 15.6 million, with nearly half already building AI systems. That overlap explains the energy in Atlanta: the community that figured out how to run the internet reliably now wants to make AI equally routine. The early signal is that Kubernetes will be the common substrate for AI not only because it’s ubiquitous, but because conformance programs like this one make it predictable.
Standards are how ecosystems scale. Kubernetes AI Conformance is CNCF replaying a proven playbook at precisely the right layer of the stack. It won’t pick winners for model servers, vector databases, or agent frameworks—and it shouldn’t. Instead, it sets a floor for what every platform must guarantee so AI teams can move faster without stapling together one-off integrations for each environment.
Three implications to watch:
Keep following TechArena.ai this week for updates and news from KubeCon + CloudNativeCon in Atlanta.

I recently visited a customer whose AI racks were reaching 750–800W per slot. Their data center layout couldn’t push more airflow so they were in hot pursuit, no pun intended, of cold plate cooling alternatives. But as they forecasted system power forward, they saw a near term horizon where cold plate technology may not provide enough thermal mitigation to address their dense infrastructure demands. They faced the question of migration to cold plate now knowing that another technology migration may be required in the future, or take the plunge into immersion cooling now.
This customer is not alone. We have reached the choke point of air cooling within highly dense data center infrastructure, and more deployments are reaching for liquid cooling solutions. Today, that liquid cooling alternative is likely a cold plate solution, delivering the right mix of cooling efficiency and required thermal control. And while this transition is playing out in data centers today, many are asking how long cold plate solutions will keep pace with data center requirements. After all, today’s racks are climbing past 1 megawatt with data center facilities scaling past 1 gigawatt representing unprecedented heat to mitigate. This leads to the question of how long cold plate’s day in the sun will last before immersion cooling becomes a required alternative.
But what is cold plate? Cold plate solutions offer controlled liquid to chip and handle thermal densities significantly better than air cooling alternatives. Many HPC and AI boxes today already support cold plate solutions. It’s relatively mature, perceived to be controllable, and less disruptive to the data center to retrofit into brownfield environments. Up to a point, it works well!
At some point, though, cool plate solutions have reached an existential challenge of heat dissipation. Customers can experience leaks or thermal escapes with highly variable AI performance as system density scales. For these customers, immersion cooling (full immersion in dielectric fluid) offers an alternative. Immersion handles higher power densities with lower energy overhead, but it requires much more system certification to deploy safely.
At Intel, we see the coming of the immersion era, at least for high performance compute clusters. That’s why we’re helping to future proof infrastructure investment by certifying Xeon platforms for immersion, ensuring that CPUs behave reliably in immersive environments. This enables higher rack densities with confidence in stability and availability.
So how should you approach the liquid alternatives? This has a lot to do with the density targets you’ve got on your infrastructure roadmap, and at what point you’ll hit a requirement for immersion. The time is now to evaluate cold plate solutions for immediate requirements and begin talking to vendors about immersion support. If you’re considering greenfield buildout, a transition to immersion sooner for your densest racks may make sense. In brownfield environments, take advantage of cold plate alternatives and their easier integration for the time being. Most importantly, strategically plan infrastructure within a long-term horizon to prioritize an efficient path through liquid cooling adoption with the right compute infrastructure support at each point in the migration path.

The automotive industry’s introduction of the Controller Area Network (CAN) protocol in 1986 marked a significant departure from point-to-point wiring for electrical connections, which until then had been the mainstay of the industry. The shift to a relatively lightweight bus-based architecture was a nod to reality: electrical content in the vehicle was scaling fast, and alternatives to one-off wiring were needed.
The first vehicle to use the CAN bus was a Mercedes-Benz S-Class in 1991. CAN connected five Electronic Control Units (ECUs) for engine, body, and climate control. That moment marked the starting point for the evolution—if not revolution—of connectivity standards in the automobile, and it set the stage for architectural disruption.
Today’s Software-Defined Vehicle (SDV) is embracing a zonal architecture, a connectivity scheme based primarily on physical location rather than the specific capability of any given actuator or sensor. This approach typically uses about 300 meters of wiring, a reduction of roughly 4,700 meters compared with earlier distributed designs—a substantial savings in both weight and cost. The zonal model leans on a myriad of connectivity technologies that deliver more robustness, reliability, and deterministic timing than prior schemes. Those improvements are essential as electronic subsystems take over greater—sometimes complete—control of the vehicle.
A useful analogy: connectivity in a car is the nervous system. It must be responsive and provide failover. The nervous system not only links the senses but also controls muscles in response to the brain. Vehicles aren’t any different.
Since CAN, many other connectivity types have been introduced to address different functions in the vehicle. Some, like Ethernet, were adapted from mainstream computing and retrofitted to automotive. Those adaptations generally address real-time, deterministic responsiveness and fault detection. Interestingly, with the possible exception of CAN, standards defined specifically for automotive haven’t found broad adoption elsewhere.
A brief aside on EMI: a wire is, electrically, an antenna. That simple model explains a lot. Both radiated energy limits and required immunity are governed by industry standards. If not managed properly, high-speed signals across long wires can create (and suffer from) EMI. Unfortunately, the need for high-performance communications is at odds with minimizing emissions. In automotive, nothing about this is easy—we just tend to take it for granted.
What follows is a quick survey of the alphabet soup now in play.
These standards are either still in use or have been replaced by newer alternatives:
Two primary technologies have dominated here:
A2B (Automotive Audio Bus), introduced by Analog Devices in 2014, uses low-cost unshielded twisted pair to carry audio from a head unit (master) to slave devices like speakers and microphones. It supports multiple channels of high-resolution digital audio and microphone arrays for hands-free calling and adaptive noise cancellation.
For lidar, cameras, high-resolution surround view, and driver information displays, SerDes links embed clocking within the data stream to achieve high rates with low latency:
There are multiple (about seven) automotive Ethernet derivatives tuned for a wide spectrum of in-vehicle needs, from 10 Mbit/s to 25 Gbit/s, with different reaches and price points. They address everything from “CAN-plus” body functions to ECU-to-ECU backbones. All use differential signaling; most ride over unshielded single twisted pair to minimize cost and weight. Alongside PHYs, the associated switching fabric has also been adapted for automotive.
Standard Ethernet is best-effort. TSN Ethernet adds determinism: end-to-end camera-to-actuator latency under 5 milliseconds with less than 50 microseconds of jitter is achievable. That performance and the ability to prioritize time-critical traffic make Ethernet viable for emergency braking. TSN is a family of specifications; several variants address time sync, scheduling, stream reservation, and reliability.
Point-to-point wiring is mostly a thing of the past.
Weight and cost pressures drove alternatives; zonal architectures dramatically trim both.
CAN signaled the first big inflection; many link types now coexist, each optimized for its job.

As AI adoption accelerates across industries, financial services takes the front line of both innovation and risk. From fraud detection to customer personalization, AI is reshaping how institutions operate. But the sector’s high stakes and regulatory complexity demand a uniquely careful approach.
At the recent AI Infra Summit in Santa Clara, Jeniece Wnorowski and I sat down with FinTech expert Anusha Nerella for a Data Insights conversation about how financial organizations can responsibly scale AI, stay ahead of fraudsters, and build teams equipped for the future.
“Many institutions are still in the early stages of AI deployment, while bad actors are moving fast and experimenting aggressively,” Nerella said.
This dynamic creates an urgent need for stronger, more agile defenses. Nerella emphasized that financial firms must accelerate their AI implementation cycles without sacrificing the governance and compliance guardrails that define the industry.
Asked what the broader technology ecosystem should do to support responsible AI in finance and enterprise, Nerella returned to the importance of regulatory alignment.
“Everything has to go through the regulatory and compliance [process] in order to make it responsibly…applicable to the enterprise sector,” she said.
But regulations alone aren’t enough. Nerella believes that financial institutions must rethink team structures and knowledge transfer to keep pace. She advocates for what she calls “reverse training,” in which organizations bring in engineers well-versed in AI frameworks and libraries, then combine their expertise with the strategic experience of senior leaders.
By fostering two-way collaboration between new AI talent and experienced financial professionals, companies can build stronger, future-ready teams.
“It becomes… a collaborative effort for sure,” Nerella explained. “It’s an equal opportunity here because whoever [has] decades of experience…might have limited exposure towards AI-based frameworks or library utilization or hands-on experience.”
This equal exchange of knowledge, she argued, is essential for success.
For organizations just beginning their AI journeys, Nerella’s advice is both practical and pointed: don’t try to boil the ocean. She recommends starting with “two or three clear use cases with ROI” and ensuring that governance and control mechanisms are in place from the outset.
“When you follow all these basic principles, then you will be able to see…result-oriented AI-based implementation from your end,” she said.
Throughout the conversation, she underscored that AI success in financial services requires human-in-the-loop collaboration.
The financial sector’s high regulatory stakes, complex legacy systems, and relentless fraud threats make its AI journey distinct. Nerella’s insights highlight that the path forward isn’t just about technology—it’s about culture, compliance, and collaboration.
To build responsible and trusted AI systems, financial organizations must:
As the industry races to stay ahead of increasingly sophisticated fraud tactics, success will depend on balancing agility and accountability.

VAST Data announced a $1.17 billion commercial agreement with CoreWeave that makes the VAST AI OS the primary data foundation for CoreWeave’s AI cloud, extending a collaboration that began when CoreWeave selected VAST to power its GPU cloud storage layer in 2023.
AI clouds are maturing from GPU-first builds to balanced, data-aware platforms that can keep training pipelines fed while serving real-time inference at scale. In that context, the data layer isn’t a bolt-on—it’s table stakes. VAST and CoreWeave are formalizing that reality in dollar terms and roadmap alignment.
The companies describe a multi-year commercial agreement that cements VAST as CoreWeave’s primary data platform. The release emphasizes instant access to massive datasets, reliability at cloud scale, and performance across both training and inference. It also highlights a “new class of intelligent data architecture” aimed at continuous training and real-time processing.
While detailed term length wasn’t disclosed, outside reporting characterizes the pact as multi-year and situates it within the broader generative-AI infrastructure build-out, noting VAST’s momentum and revenue trajectory this year.
CoreWeave is known for GPU-accelerated infrastructure tailored for AI/ML, rendering/VFX, and other compute-intensive workloads. VAST’s AI OS consolidates data and compute services, with the company positioning its DASE architecture as a parallel distributed system designed to remove trade-offs between performance, scale, and resilience. In practical terms, the pitch is a single, scalable substrate that can be deployed across any CoreWeave data center to support both throughput-heavy training and latency-sensitive inference paths.
Two strategic threads stand out:
This isn’t a green-field pairing. CoreWeave first named VAST as the data platform for its NVIDIA-powered AI cloud back in 2023, and since then both companies have scaled rapidly alongside enterprise AI adoption. Today’s announcement formalizes that relationship with a sizable commercial framework and sets expectations around platform primacy.
AI infrastructure buyers are hunting for time-to-value: they want capacity that stands up quickly, sustains training throughput, and serves inference without spiraling costs. That’s pushing clouds—especially specialized “neo-clouds” like CoreWeave—to harden their data planes with predictable performance and global operability. The $1.17B figure signals that in the AI era, the data layer is where performance, reliability, and unit economics converge. External coverage also notes VAST’s broader customer footprint and fundraising signals, reinforcing the company’s position as more than a storage vendor—it’s pitching a full AI operating substrate.
VAST says the partnership will continue to evolve with shared product development. An analyst community briefing with CEO Renen Hallak on Thursday, November 13, will unpack strategy implications and additional updates.
The signal here isn’t just the dollar figure—it’s the architectural vote: CoreWeave is betting that a unified, software-defined data plane is indispensable to AI cloud differentiation. For VAST, “AI OS” stops being slideware and becomes contractually central to one of the most prominent AI clouds. The near-term win is customer experience—simpler pipelines, faster iteration, fewer knobs to turn. The longer-term implication is competitive pressure: hyperscalers and other neo-clouds will need similarly opinionated data stacks that erase the gaps between training and inference. If VAST and CoreWeave can translate this alignment into measurable SLA gains and lower delivered cost per token/frame/query, this deal will read as a blueprint for how AI clouds professionalize the data layer at scale.

Billions of customer interactions during peak seasons expose critical network bottlenecks, which is why critical infrastructure decisions must happen before you write a single line of code.

Recorded at #OCPSummit25, Allyson Klein and Jeniece Wnorowski sit down with Giga Computing’s Chen Lee to unpack GIGAPOD and GPM, DLC/immersion cooling, regional assembly, and the pivot to inference.

As a national security professional developing next-generation tools and tradecraft on the front lines of the cybersecurity war, I’ve been wondering: Is this conflict winnable?
I polled a couple dozen friends and colleagues—CISOs, federal law enforcement officers, hackers, interns, and others, and the consensus was sobering: the cybersecurity war is a stalemate. Tech cuts both ways. Attackers and defenders keep leveling up, and there’s no silver-bullet tool that ends the fight.
That led me to a question: Is this cyber conflict fundamentally analogous to the War on Drugs? Both look like persistent, systemic battles that can never be fully won, unlike, say, train robbery in the American West—a criminal trend that burned out by the early 1900s.
It seems to me that these two wars are not merely technical or economic problems; they are enduring conflicts.
That nature determines how we should engage. If the cyber war could be ended by single technical breakthrough—like train robbery faded with the disappearance of physical cash on trains—we should put all effort into inventing and adopting that tool. If, on the other hand, it is an enduring arms race, we shift focus from preventing every breach to building a resilient digital immune system.
The strategy shifts from prevention to resilience, reflected in today’s emphasis on ZTA (Zero Trust Architecture), SOAR (Security Orchestration, Automation, and Response), and XDR (Endpoint Detection and Response). Success for a CISO is measured less by the absence of breaches and more by speed and recovery: mean time to detect, time to contain, and time to recover.
Both the cybersecurity war and the War on Drugs are enduring struggles powered by strong economic incentives, global in scope, and defined by asymmetric contests against adaptive, networked adversaries.
President Richard Nixon declared the War on Drugs in 1971; the Drug Enforcement Administration followed in 1973. For decades, the focus was enforcement. In recent years, many states have legalized or decriminalized marijuana and the federal stance has shifted toward more public-health-oriented approaches. After immense effort and sacrifice, the practical outcome resembles a stalemate rather than a decisive victory.
By contrast, train robbery in the late-19th-century American West was a localized, tactical crime against fixed infrastructure. Consider the Wild Bunch’s 1900 attack on a Union Pacific train near Tipton, Wyoming: the target was a safe with gold and banknotes. As banks shifted to electronic transfers and reduced the movement of physical cash, the opportunity evaporated and the crime largely disappeared.
The War on Drugs and cybersecurity don’t behave that way. They are global and dynamic, with adversaries who adapt to every intervention. They demand continuous management and strategic adaptation, not promises of final eradication.
Both domains operate as markets with durable incentives. Enforcement and defense actions raise operational risk; in illicit markets, that can increase margins, attracting more sophisticated actors—the Hydra effect.
Operations are borderless. Transnational networks span jurisdictions; gray logistics, cryptocurrency rails, and dark-web marketplaces collapse distance and jurisdiction, enabling payments, laundering, procurement, and coordination. In narcotics markets, growing use of cryptocurrencies and dark-web services shows the convergence; in cybercrime, the same rails fund ransomware and broker access.
Adversaries are adaptive and decentralized. Networked cells and affiliate models enable rapid mutation in tactics, techniques, and procedures.
The general population bears the brunt of these illicit economies. This includes the terrible crisis of drug addiction, the devastating impact of violence on civil society, and the massive financial loss and disruption of trust caused by cybercrime and data breaches. Additionally, law enforcement, officers, and military personnel globally endure intense danger as they confront sophisticated, well-funded criminal networks. Their dedication comes at a high cost.
The lesson from the War on Drugs is that we must abandon the language of winning a war that is fundamentally systemic and adopt a posture of strategic management and resilience.
In cybersecurity, every early detection, every rapid containment, and every clean recovery is a tactical win that raises the adversary’s cost of doing business. Cybersecurity leaders should emphasize the achievable and sustainable goals of availability and resilience. We may never win this war outright, but we can ensure our vital functions retain availability.

The current speed of the data center industry’s transformation is unlike any in its history. Where infrastructure upgrades once followed multiyear cycles, the pace now is annual, at the speed of consumer electronics. My recent conversation with Kelley Mullick, CEO and founder of Avayla, at the Open Compute Project (OCP) Global Summit in San Jose revealed how liquid cooling has moved from niche application to critical infrastructure component, and why standardization will determine whether the industry can meet the moment.
During our TechArena Data Insights episode with Solidigm’s Jeniece Wnorowski, Kelley shared insights from her extensive career in cooling technologies and her current role as chair of OCP’s industry liaison team. Her perspective illuminates both the technical challenges operators face and the collaborative frameworks emerging to address them.
As data centers continue to evolve in the race to support AI-enabled workloads, Kelley noted that scalability has emerged as the primary challenge facing operators today. During our conversation, she cited an OCP keynote by Meta’s head of infrastructure, Dan Rabinovitsj. What once took multiple years now happens annually, leading him to compare current deployment cadences to consumer electronics rather than traditional data center timelines. “That was a big insight for me,” she said. “I live and breathe in this space, but it is a real insight to make that comparison.”
With this primary challenge of scalability come a host of secondary challenges, including cooling this infrastructure, and preparing for liquid cooling. Before 2022, more than 90% of data centers relied on traditional air cooling. In just two years, liquid cooling adoption has surged to approximately 30% of the market. This rapid acceleration has driven significant growth in coolant distribution units (CDUs), the critical infrastructure components that deliver coolant from chips to distribution systems. Recognizing this need, at the 2025 global summit, OCP announced a new working group focused on CDU specifications and best practices.
As the industry faces these challenges, collaboration and defining industry standards become more important than ever. At OCP, Kelley is chair of an industry liaison team that connects external standards organizations to OCP. This year, the team had two announcements to make. First, OCP and the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) signed a memorandum of understanding, establishing formal collaboration to eliminate duplicative efforts and create clearer pathways for standards development. In addition, a new liaison role connects OCP directly into ASHRAE’s processes, ensuring thermal management standards align with open infrastructure development. Second, ASTM International launched a new subcommittee on insulating fluids for immersion cooling applications, with a member of the industry liaison team working with that organization as well.
In another example of the importance of collaboration, Kelley highlighted the Universal Quick Disconnect (UQD) specification 2.0, which addresses quick disconnects within the cold plate cooling loop from the chip to the CDU.
“We had many different technologies…There was also a lot of proprietary designs, and this was creating a lot of problems and challenges within the industry,” she said. “Specification 2.0 was one of the first within liquid cooling to come out and really make standards more readily available to address the challenge of interoperability and heterogeneity within the data center.”
Looking ahead, Kelley sees a future that will likely involve hybrid cooling strategies rather than single-solution deployments. While direct-to-chip cooling currently dominates new deployments, Kelley anticipates growing adoption of immersion cooling as workloads demand complete thermal management for entire compute stacks. Cooling networking equipment, memory modules, and storage alongside compute will become increasingly important. Immersion cooling’s ability to capture 100% of generated heat makes it a valuable complement to direct-to-chip solutions, particularly interest in heat recapture and reuse rises.
As AI-enhanced workloads continue driving unprecedented infrastructure demands, the industry’s ability to standardize rapidly will determine whether operators can deploy at scale while maintaining flexibility for future innovation. Kelley Mullick’s leadership in standards development through OCP demonstrates how open collaboration can accelerate adoption while maintaining interoperability. Organizations that engage with standards bodies today and build relationships across the ecosystem will be best positioned to capitalize on the liquid cooling transition reshaping data center design.
For more information about Avayla, visit avayla.net. To learn about OCP’s Cooling Environments working group and standards development, visit the OCP wiki.

Tejas Chopra builds for scale—and teaches others how. He designed metadata at Datrium, re-architected storage at Box, and now leads ML and storage platforms at Netflix, delivering reliability under pressure. He extends that mission beyond systems, co-founding EnsolAI and GoEB1 to help people put AI to work for growth.
As one of our newest TechArena voices of innovation, Tejas cuts through the hype cycle with a builder’s lens: agent reliability, how to separate research from product, and why quiet, iterative work actually moves the needle. Expect lessons that translate—from billion-dollar infra teams to two-person startups, grounded in real problems, not buzzwords.
A1: I’ve always been drawn to systems that scale. From building metadata storage engines at Datrium to re-architecting storage infrastructure at Box, and now leading machine learning and storage platforms at Netflix, my focus has been on creating reliability at scale. Over time, that curiosity extended beyond large-scale systems into how technology can drive opportunity—which led me to found EnsolAI and GoEB1, both built to help people leverage AI for meaningful professional growth.
A2: Moving from deep systems engineering into entrepreneurship. Building startups taught me a new kind of scalability—not of data or compute, but of people, purpose, and conviction. It forced me to think like both an engineer and a customer—and that ultimately made me a better technologist.
A3: Innovation, for me, used to mean cutting-edge algorithms or new architectures. Today it means solving a real problem in a way that’s repeatable, cost-aware, and human-centric. The best innovations don’t always come from new technology—they come from seeing a familiar problem differently.
A4: We’re overlooking agent reliability—ensuring AI agents act safely, predictably, and accountably. As multi-agent systems become mainstream, frameworks for trust, observability, and control will determine which companies sustain long-term adoption and which don’t.
A5: I ask three questions: Does it remove a pain point or just sound exciting? Can it scale sustainably? And would someone pay for it today? If all three are yes, build it. Otherwise, it’s research—not a product.
A6: That innovation has to look glamorous. In reality, it’s often quiet, iterative, and unglamorous. The real breakthroughs come from people fixing what everyone else tolerates.
A7: Collaborators. AI amplifies creative range but can’t replace intent or taste; the human role shifts from generating to guiding—shaping AI outputs with context, nuance, and moral clarity.
A8: Bridging the gap between technical capability and access. So much potential is locked behind systems, jargon, and privilege. If we can make advanced tools—like AI—accessible and affordable, we democratize innovation itself. That belief drives both EnsolAI and GoEB1.
A9: Business Sutra by Devdutt Pattanaik. It reframed how I think about leadership and innovation—not as rigid hierarchies of control, but as dynamic relationships among purpose, people, and context. It taught me that how we think determines what we build. That lens helps me design systems and teams that are adaptable, not brittle.
A10: I break it down to constraints and first principles. What’s unchangeable? What’s optional? Once you know that, complexity usually reduces to a few core decisions. I write, diagram, and simulate trade-offs until the signal emerges.
A11: Travel. Seeing how people solve everyday problems with limited resources constantly resets my design lens. It’s humbling and practical—both qualities tech needs more of.
A12: TechArena brings together people who care about depth over noise. I’m excited to share learnings from building at Netflix scale and from starting lean, self-funded ventures. I hope readers take away that innovation can happen anywhere—from billion-dollar infra teams to two-person startups—if you focus on solving real, painful problems well.
A13: Pãnini or Ãryabhata. Pãnini’s precision in defining Sanskrit grammar mirrors the elegance we seek in programming languages today, while Ãryabhata’s mathematical imagination still informs how we model the world. I’d ask how they balanced logic and intuition—how they derived universal truths from patterns in language and nature. That balance is what all great technology ultimately strives for.

“More isn’t always more.”
In the competitive landscape of AI infrastructure, conventional wisdom suggests that more resources create better outcomes. But in my recent Fireside Chat with Lisa Spelman, CEO of Cornelis Networks, she argued exactly the opposite, saying strategic constraints and focused execution enable smaller companies to outmaneuver established giants. With Spelman marking her first year as CEO and Cornelis Networks celebrating its fifth year as a company, our conversation provided an opportunity to reflect on how these principles have shaped the company's approach to competing in AI infrastructure.
Spelman, who joined Cornelis Networks after years at Intel Corporation, emphasized that constraints sharpen focus in ways abundant resources cannot. At larger organizations, teams may pursue projects that, while technically sound, don’t directly address the most critical customer problems. In contrast, Cornelis maintains discipline around resource allocation, ensuring every engineer and dollar drives toward solving specific customer challenges in network efficiency and performance.
“Constraints open up creativity and they dial in focus,” she said. “The focus that you can have in a small company allows you to not have resources that wander. It’s not that they’re doing bad work or not focused on good things, but they’re not staying hung in on what is the most important thing for your company to solve your customer’s problem.”
While the company remains “maniacal” in is focus of addressing major challenges in the performance and efficiency of AI and high-performance computing (HPC) applications, Spelman noted that in her own role as the CEO of a smaller company, that work can take many forms. Her days blend vision setting and operational leadership with technical evangelism, which is especially key for a small company with a performant technology competing against entrenched solutions.
Spelman noted that many data center professionals claim immunity to marketing influence. Yet awareness, familiarity, and comfort—building trust—remain essential stages in technology adoption. “We welcome the opportunity to compete on our technical merits,” she said. “But you don’t go from 0 to 60 without making sure you cross off some of those steps of familiarity and comfort with your solution.”
The pace of AI market evolution exacerbates a classic strategic tension for CEOs. Moving too fast risks over-investing in solutions for markets that don’t yet exist, but moving too slow results in irrelevance. In considering this trade-off, Spelman said she errs toward speed, noting that companies can create markets through vision and execution. “Sometimes that’s actually what you need to do,” she said. “It’s not easy, but nothing is easy. It’s not meant to be.”
Cornelis addresses this challenge through intensive customer engagement, using its ecosystem relationships to validate and refine product roadmaps continuously. And the company’s smaller size provides decision-making advantages over larger competitors. Strategic discussions that might require months at enterprise organizations conclude in 30 minutes at Cornelis. This agility allows rapid incorporation of customer feedback without navigating competing priorities and shared resource constraints typical of large companies.
The cultural transformation accompanying this strategic approach extends beyond external positioning. Over the past year, Cornelis evolved from its high-performance computing roots into what Spelman describes as an AI-native organization. This shift encompasses customer engagement models, workload prioritization, and fundamental integration of AI tools throughout operations. The founding team’s early adoption of AI accelerants created infrastructure that enables the company to match market pace.
Spelman reflected on what makes this environment compelling for team members. In a smaller organization, every person’s contribution directly impacts outcomes. “There’s just something about being at a place where every single day, every single person here knows that their work matters,” she said. “I believe that we have a chance to improve the way AI is delivered, used, and consumed. We have a chance to ease the human condition.”
Each team member serves as the expert in their domain, creating mutual accountability between leadership and individual contributors. This structure connects daily work to larger missions around improving AI efficiency and enabling discovery.
Cornelis Networks demonstrates how strategic constraints combined with technical depth can create competitive advantages against larger, established competitors. The company’s focused approach, rapid decision-making, and AI-native culture illustrate that market success in infrastructure depends less on absolute resource levels than on alignment, agility, and deep customer understanding. As AI infrastructure demands continue evolving, organizations that maintain sharp focus while adapting quickly to customer needs will be best positioned to compete regardless of their size.
For more information about Cornelis Networks’ approach to AI networking infrastructure, visit cornelisnetworks.com or follow Cornelis Networks on LinkedIn. The company will be exhibiting at SC25 in November and maintains an active presence at industry events focused on AI infrastructure.
Watch the podcast | Subscribe to our newsletter

From #OCPSummit25, this Data Insights episode unpacks how RackRenew remanufactures OCP-compliant racks, servers, networking, power, and storage—turning hyperscaler discards into ready-to-deploy capacity.

Open-source AI has quickly evolved from lab experiments to today’s role as an infrastructure backbone of modern enterprise deployments. Taking advantage of this powerful resource can accelerate IT agility, but as with many open-source alternatives, implementing with eyes wide open is critical to deployment success.
I recall working with a customer who prototyped an LLM-based retrieval system using an open model. The experience drove great results in test, but once pushed to production, it fell apart. The result? The customer experienced inconsistent latency, scaling failures, memory pressure, GPU underutilization, and patchy support. While open AI stacks bring advantages of transparency, adaptability, and community velocity, without the right platform foundation, things can go off the rails.
We’ve designed Xeon CPUs with open-source platform support in mind. In fact, a core strength of Xeon CPUs is their compatibility with broad open-source toolchains including TensorFlow, PyTorch, and ONNX, based on over a decade of investment in platform optimization. We have extended that with support for quantized inference, CPU acceleration libraries, and solution portability, helping to reduce friction when deploying open models across hybrid environments.
Of course, that is just a start to what’s needed to ensure an agile platform foundation. Open-source tools often lag in orchestration, monitoring and service management support. Intel and ecosystem partners have invested in tuning orchestration layers and performance libraries like OpenVINO and oneAPI to bridge that gap.
Many leading cloud providers are integrating open source LLMs natively into their services, accelerating adoption, with examples that have gained traction including infima, Mistral, and Llama. In the research community, frameworks like Hugging Face mature weekly, lowering barriers for enterprise adoption. And of course, underlying CPU optimizations including support for BF16 and INT8 drive open model performance higher, making them applicable for a number of AI inference targets in the enterprise.
To get started with open-source AI, become familiar with framework alternatives and tools available to help implement within your environment. Plan the right infrastructure for your entire AI pipeline, and consider Xeon 6 processors as your CPU foundation, whether for a head node of an accelerated platform or a CPU driven workload where accelerated processing is not required.

A modern storage solution is more than just a place to keep files; it’s a strategic combination of hardware and software designed to manage, protect, and access your most critical asset.
These days, this strategic choice often comes down to two main paths: traditional on-premises storage arrays and first-party, cloud-native services. In this post, we'll break down what each option means, compare their differences, and help you decide which is the right fit for your business.
A storage solution refers to the combination of hardware and software components that are designed to store, manage, protect, and retrieve digital data. At its core, a storage solution includes the physical devices, hard disk drives (HDDs) or solid-state drives (SSDs), that hold your data.
More complex disk arrays aggregate multiple drives to provide higher capacity, improved reliability, and better performance. These disk arrays often support features such as redundancy (RAID), hot-swappable drives, and scalable architectures.
Beyond the hardware, storage solutions also encompass the software that controls and optimizes the storage environment.
The concept of data storage goes all the way back to 1837 and Charles Babbage’s Analytical Engine. Here’s a timeline:
Storage management software handles provisioning, monitoring, and optimization, tasks like data deduplication and compression. Modern storage arrays add automated tiering, encryption, and integration with cloud platforms for hybrid deployments.
Modern storage arrays come in various forms: NAS for file sharing, SANs for block-level storage, and object storage for unstructured data. While these systems provide full control over hardware and software, they also require your IT team to handle maintenance, updates, and scaling.
First-party, cloud-native storage is a service developed, managed, and offered directly by a major cloud provider, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP). Instead of buying and managing hardware, the cloud provider handles everything.
Some traditional storage vendors have partnered with hyperscalers to offer their technology as first-party services. Examples include:
This gives you the best of both worlds: trusted storage technology delivered with the convenience of a native cloud service
How do these approaches compare? The choice depends on your needs for cost, control, and scalability.

Not all storage vendors offer first-party, cloud-native services. Companies like Dell, HP, Hitachi, and Nutanix provide cloud connectivity through different models. These are often called "cloud-integrated" or "third-party" solutions.
Typically, these vendors allow you to run their storage software on public cloud infrastructure. This means you can extend your on-premises environment to the cloud for things like backup or disaster recovery. While this approach offers flexibility, it usually means the storage service is managed by you or the vendor, not by the cloud provider itself. This can add complexity to management and may not offer the same deep integration as a true first-party service.
Deciding between a first-party, cloud-native service and extending an existing array to the cloud depends on your priorities.
First-party, cloud-native storage is often the better choice if you prioritize:
Extending an on-premises array might be a better fit if you need to:
The way you store and manage data is fundamental to your business success. While on-premises arrays offer control, first-party, cloud-native services from providers like AWS, Azure, and Google Cloud deliver scalability and integration.
Both approaches offer distinct advantages, and understanding these differences is key to building a resilient and future-proof storage strategy. For ongoing insights and practical tips on cloud storage and IT infrastructure, be sure to follow my posts on LinkedIn.

DataraAI CTO and co-founder Durgesh Srivastava unpacks a data-loop approach that powers reliable edge inference, captures anomalies, and encodes technician know-how so robots weld, inspect, and recover like seasoned operators.

As AI data centers push into the gigawatt era, cooling is moving to center stage—not just to keep systems within spec, but to enable the next generation of compute.
At the recent AI Infra Summit in Santa Clara, Jeniece Wnorowski and I sat down with Scott Twomey, Senior Director of Global Business Development at Flex, for a Data Insights conversation on how the company is scaling advanced cooling technologies for the AI era.
Flex is best known as a global manufacturing powerhouse, operating more than 100 locations across 30 countries. With 17 U.S. facilities spanning seven million square feet and an additional nine million in Mexico, Flex also has one of the largest advanced manufacturing footprints in North America. In 2024, it acquired JETCOOL Technologies Inc., a leader in advanced thermal management solutions. Together, the teams are integrating JetCool’s patented microconvective cooling technology into Flex’s global manufacturing and data center infrastructure. Combined with Flex’s comprehensive compute and power solutions, including vertical power delivery, they deliver a vertically integrated approach to power, cooling, and rack infrastructure—streamlining deployment for hyperscale AI systems.
As part of Flex, JetCool specializes in single-phase, direct-to-chip cooling technology that uses an array of microjets to target processor hot spots directly, delivering a significantly lower thermal resistance than conventional microchannel cooling. This technology is now integrated into Flex’s server reference platforms, improving heat transfer, minimizing heat gradients, and stabilizing CPU temperatures across complex silicon architectures and IT stacks. The outcome is faster time-to-market, higher performing chipsets, and better thermal management for the industry’s most demanding workloads.
Twomey shared how Flex is helping scale JetCool’s production globally, leveraging its advanced manufacturing footprint and expertise to bring liquid cooling solutions to market faster. Together, they’re expanding their cooling portfolio and building a unified thermal roadmap—from SmartPlate cold plates and embedded semiconductor cooling to vertically integrated liquid-cooled racks—to meet both current and future AI cooling demands.
As JetCool builds out their liquid cooling ecosystem, the company is evolving into a comprehensive, end-to-end liquid cooling provider—streamlining deployments through a single partner. Backed by Flex’s global manufacturing infrastructure, JetCool is expanding its portfolio to support deployments from the die level to the rack, row, and facility systems. This broader offering flattens supply chains, reduces vendor integration friction, and enables easier customization of modular liquid cooling solutions. The current product line includes cold plates, coolant distribution units, manifolds, and quick disconnects, with more in development for 2026. With Flex’s support, JetCool is positioned to deliver scalable, integrated cooling solutions that meet the evolving demands of AI infrastructure.
Rising power density and thermal loads are now a structural trend, not a spike. The response must be continuous engineering—greater thermal capacity, faster design cycles, tightly integrated solutions, and architectures that scale generation over generation. Cooling can’t be an afterthought; it must advance in lockstep with processor TDP, with built-in headroom for future Superchips.
“One of the many reasons that we looked at JetCool from an acquisition standpoint is that they did have a portfolio of solutions to address the here and the now as well as the future,” he said.
JetCool’s strategy spans three tiers to match rising processor thermal design power (TDP).
SmartPlate is a fully sealed liquid cold plate designed for specific processor families, cooling over 4,000 watts per socket.
SmartLid extends headroom by removing both thermal interface layers to directly route fluid to the processor lid, cooling over 5,000 watts per socket, preparing for the next wave of ultra-dense accelerators.
SmartSilicon embeds JetCool’s microjet array directly into the silicon substrate—an approach that requires tight collaboration with end customers, chipmakers, and foundry partners.
Together, these solutions give customers a clear path from current high-TDP processors to tomorrow’s even denser AI hardware.
The gigawatt era is redefining what’s possible in AI data center cooling. As GPU power envelopes climb toward 5,000 watts and beyond, incremental improvements won’t cut it. Cooling is becoming a strategic enabler, not a support function.
JetCool and Flex’s roadmap—from SmartPlates to SmartSilicon—reflects the kind of multi-layered innovation and manufacturing scale the industry will need. By combining Flex’s global manufacturing muscle with integrated solution design, JetCool is positioning itself as a key player in scaling high-density, AI-optimized data centers.
*Microconvetive cooling, SmartPlate and SmartLid are JetCool trademarks.

At the recent OCP Global Summit in San Jose, I chatted with Carl Schlachte, CEO of Ventiva, to talk about something that sounds counterintuitive at first blush: what five years of grinding on laptop thermal design can teach hyperscale and enterprise data centers. The short answer, in Schlachte’s telling, is “a lot”—and soon.
Ventiva has been heads-down in one of the harshest thermal environments outside a rack: thin, sealed consumer and commercial laptops where millimeters matter, acoustics are unforgiving, and reliability thresholds are brutal. Schlachte says that discipline—solving for tight envelopes, variable duty cycles, and field reliability—translated cleanly to servers and accelerators once the right people took notice. That notice didn’t come through a cold pitch; it came laterally. Some of the same firms that collaborate with Ventiva on next-gen laptops also have server and data-center teams.
That “reference sold” path—laptop counterparts vouching Ventiva into server and facility groups—matters for two reasons. First, it shortens the confidence cycle when a new thermal approach shows up in a risk-averse environment. Second, it implies the solution isn’t a bespoke one-off for a single chassis; it’s a design pattern hardened by millions of laptop hours that can be application-engineered into many form factors.
Schlachte also hinted at timing. Ventiva is preparing announcements around CES—framed as “groundbreaking” systems that, in his words, “change the nature of what a laptop is.” While details are under wraps, the more interesting part for data-center buyers is what he claims won’t be necessary to port the tech into servers: net-new R&D. The building blocks are already validated for lifetime and scale in a tougher mechanical envelope. What remains is application engineering—integrating into the physical realities of 1U/2U servers, dense accelerators, and varied sleds, and aligning with rack-level airflow and power designs.
Why would laptop learnings carry weight in a 600 kW row? Constraints rhyme. In both spaces, thermal budgets are tight and rising, hotspots shift under dynamic workloads, and acoustics or vibration can’t become a side effect. Reliability is non-negotiable. In laptops, the penalty for errors shows up as throttling or returns; in AI racks, it’s stranded GPUs, erratic performance, and higher TCO. Techniques that squeeze higher heat flux out of compact geometries—whether through novel heat spreading, phase-change management, or smarter flow control—map well to constrained server envelopes and to edge locations where facility retrofits aren’t feasible.
The OCP Summit context matters here. Over the past 18 months, the industry has been pivoting from server-first to rack- and multi-rack-first thinking. As power densities spike and liquid cooling proliferates, the battleground has moved to materials, manifolds, safety regimes, and serviceability in brownfield realities. Ventiva’s message: there’s still real gain to be had inside the box—at the component and sled levels, especially by reusing tactics proven in tight-tolerance laptop designs. That doesn’t replace facility-level innovation; it complements it by reducing the thermal tax inside each box.
Schlachte describes the reception at OCP as “amazingly good,” and that tracks with what we heard on the showroom floor: operators want both macro and micro levers. On Monday, teams modeled coolant loops; on Tuesday, they fought a stubborn NIC hotspot and the fan curves needed to keep a CPU in bounds while a GPU surged. Even a few percent more stable performance per server—or holding the same acoustic or power profile at higher load—added up fast at scale.
There’s also a deployment story embedded here. If the core technology ships in laptops first, the supply chain, QA, and lifetime data will ramp quickly. For data-center adopters, that can de-risk qualification, shorten pilot cycles, and improve spares forecasting. The open question is the integration path: which OEMs and ODMs pick this up, and how fast do they tune sled designs to exploit it? Schlachte frames Ventiva’s next step as heavy application engineering—helping partners adapt form factors and operational playbooks without forcing a full mechanical redesign.
For operators, the practical questions are straightforward. What is the delta on junction temperatures at given loads? How does the solution behave under bursty AI inference vs. sustained training? What’s the impact on acoustics, airflow directionality, and contamination risk? And crucially, how does it coexist with emerging liquid strategies—direct-to-chip, cold plates, or hybrid air/liquid racks? Schlachte suggests it’s not either/or; it’s making the box smarter so that rack-level choices deliver more consistent returns.
We like the vector here: translate hard-won laptop thermal tricks into compact, serviceable gains at the server and edge. The go-to-market signal—being ushered into data-center teams by adjacent laptop engineers—cuts through typical skepticism and hints at broad applicability. That said, the data-center bar is high. To win trust, Ventiva should publish clear, apples-to-apples results: sustained workload deltas, hotspot mitigation under mixed CPU/GPU loads, acoustic and power impacts, and field maintainability. Even better, show coexistence with standards-based server designs and liquid-cooling topologies in the wild.
Net: the demand is here. Land a few lighthouse deployments with OEM/ODM partners, document coexistence with standards-based components, and ship pragmatic integration guides. Do that, and Ventiva’s differentiation becomes a de-risked choice for operators who need every watt and every degree back in the AI era.

From CPU orchestration to scaling efficiency in networks, leaders reveal how to assess your use case, leverage existing infrastructure, and productize AI instead of just experimenting.

NVIDIA brought its GTC event to Washington, D.C. for a reason.
Spanning three days at the Walter E. Washington Convention Center, the event targeted policymakers, integrators, and program leaders deciding where national-scale AI capacity will live and how it will be governed.
The keynote message, delivered today by Jensen Huang, landed clearly: treat AI as an industrial system, not a server purchase. In practice, that means Department of Energy (DOE) supercomputers, quantum-classical coupling, AI-infused radio access networks, autonomy at fleet scale, and a drumbeat on U.S. manufacturing.
The headline announcement centered on the DOE. Argonne National Laboratory will stand up two new AI systems—Solstice at roughly 100,000 Blackwell GPUs and Equinox at about 10,000—both targeted for the first half of 2026 and tied together with NVIDIA’s networking stack. Oracle is the prime hyperscale partner on the larger system. The subtext is supply and cadence: NVIDIA guided to an eye-popping bookings run rate, reinforcing that Blackwell-class capacity will be allocated, not casually procured. For public-sector programs and regulated industries, planning windows now start with guaranteed delivery of GPUs, interconnect, racks, and liquid cooling in the same contracting cycle.
RAN is the linchpin of AI at the edge, and NVIDIA has been pressing this front for roughly three years. The Nokia alignment doubles down on an AI-RAN path that moves inference and optimization into the radio stack itself for latency, efficiency, and fleet-level control.
Beyond speeds-and-feeds, this is about industrial policy: rebuilding leadership in critical infrastructure by composability across RAN silicon, GPU acceleration, and software. For carriers and federal networks, the takeaway is that AI will live at the edge as much as in the region, and procurement will increasingly reward end-to-end blueprints over stitched one-offs.
The Nokia play makes that third leg—edge—explicit, carrying the same AI toolchain out to radios and cell sites. If you want performant AI at the edge, you have to start with the RAN.
Quantum computing moved from slideware to an integration story. NVQLink is NVIDIA’s architecture to couple GPUs with early-stage quantum processors so error correction, classical pre/post-processing, and AI-driven orchestration can sit close to QPUs. Dozens of partners—from lab programs to vendors like IonQ and Rigetti—give the idea immediate surface area. The pragmatic read for near-term users is straightforward: hybrid quantum-classical workflows can accelerate today, long before fault-tolerant machines arrive, provided the links are tight and the toolchains are familiar.
Autonomy returned to the roadmap with scale. NVIDIA and Uber set a target to field an autonomous fleet on the order of 100,000 vehicles starting in 2027, framed as an AI data-factory problem as much as a sensor stack. On the vehicle side, NVIDIA’s DRIVE platform continues to broaden its bench with Stellantis, Lucid, and Mercedes-Benz in the fold. The message is consistency: ingest, simulate, retrain, and redeploy in tight loops—exactly the “factory” model NVIDIA wants buyers to internalize.
Since 2020, onshore manufacturing has been table stakes—not a new pivot. What’s changing now is its weight in RFP scoring across this decade: locality, sovereignty, and supply assurance sit alongside performance-per-watt. Jensen Huang’s emphasis on U.S. milestones for Blackwell and new assembly footprints (Arizona, Houston) signals that “where” and “how” you build will remain a first-order decision throughout the decade.
Rather than just supplying connective tissue, Google is clearly moving to monetize its AI stack. Blackwell-based instances on Google Cloud pair with an on-prem path via Google Distributed Cloud running Gemini on Blackwell systems. The pitch is commercial, not merely architectural: one toolchain, multiple SKUs, and consumption paths that let buyers pay for capability where it runs best.
This isn’t either-or. It’s yes-and: burst to cloud, anchor sensitive work on-prem, and, increasingly, extend the same models and MLOps to the edge.
Synopsys added a concrete proof point that “AI + accelerated compute” collapses engineering schedules. NVIDIA is piloting Synopsys AgentEngineer for AI-enabled formal verification integrated with the NeMo Agent Toolkit and Nemotron open models—an early signal that agentic workflows are entering signoff. On the simulation side, Synopsys highlighted dramatic gains: lightning-fast computational fluid dynamics claims with GPU acceleration and AI initialization via Ansys Fluent, and up-to-15× speedups for QuantumATK atomistic simulations on CUDA-X and Blackwell. A defense electronics customer cited jobs dropping from weeks to hours. Those numbers, even if workload-dependent, are exactly what program managers want to hear when timelines and budgets are under pressure.
Deployment is now a three-part system: cloud for elasticity, on-prem for control, and edge for immediacy. The Nokia RAN work is the connective tissue that makes the edge leg viable at scale.
Call it what it is—an operating plan for national-scale AI. NVIDIA framed AI as an industrial system across labs, networks, vehicles, and factories, and positioned itself to supply the muscle, the middleware, and the maps.
DOE wins plus Nokia and Uber partnerships reinforce one theme: assemble end-to-end AI factories and simplify the buy. Synopsys’ gains suggest the next bottleneck moves to orchestration, data pipelines, and power as verification agents and GPU-accelerated physics compress schedules.
This was an assertion of scale at the very moment scale is contested. The partnerships and roadmaps are real, but so are the political and community headwinds around AI factories. If GTC DC shifts anything, it’s the center of gravity of the debate: from “can we build it?” to “where, how, and on whose terms?”