TechArena Forum | AI, Cloud & Innovation Community Discussions

Dec 8, 2025

From Operators to Innovators: Inside Midas Immersion Cooling

The data center industry stands at an inflection point. As AI-enabled workloads drive compute densities beyond 100 kilowatts per rack, traditional air cooling approaches are reaching their limits. My recent conversation with Solidigm’s Jeniece Wnorowski and Scott Sickmiller, CEO of Midas, revealed how immersion cooling technology has evolved into a practical solution for today’s most demanding workloads.

What makes Midas’s perspective particularly valuable is their origin story. Unlike companies that developed immersion cooling as a product, Midas became a provider because they were first a user facing real cooling challenges in their Austin data center.

From Necessity to Innovation

Midas began as a data center operation in 2011, quickly becoming the go-to provider for hard-to-cool IT infrastructure. The growth trajectory forced them to look beyond traditional air cooling solutions. Between 2011 and 2012, the team iterated through multiple immersion cooling designs, ultimately developing and patenting their own solution. In 2016, they made the decision to exit the data center business and focus exclusively on providing immersion cooling infrastructure to the industry. “And the rest, as they say, is history of 4,000 tanks,” Scott said.

This user-first development approach shapes everything about Midas’s technology today. As Scott explained, having to maintain the systems themselves drove design decisions toward user-friendliness and operational efficiency that competitors who never operated the technology might overlook.

The Physics of Immersion

At its core, immersion cooling leverages a simple advantage: liquids dissipate heat approximately 1,200 times more effectively than air. By submerging IT equipment, data centers immediately gain this thermal efficiency advantage. However, as Scott emphasized, doing immersion well requires more than just dunking servers in liquid.

Early on, the team learned that success depended on computational fluid dynamics (CFDs). CFDs are critical to ensuring that the dielectric liquid reaches all heat sources, engages with them, and moves away from them with a uniform flow. CFDs are crucial to ensuring that this happens no matter the rack’s form factor. While adapting to diverse hardware designs is a challenge, Scott noted, “At the end of the day, it’s only physics. So the physics can support the workload. We just have to fit the form factor into the physics box.”

The Thermal Recovery Advantage

Beyond raw cooling efficiency, immersion cooling enables thermal energy recovery in ways air-cooled systems cannot match. The dielectric fluid not only captures heat more effectively than air, it also retains that heat longer, enabling efficient transfer to other systems.

Scott shared an example from a recent meeting with a German district heating facility. In district heating, water or another fluid is centrally heated and then pumped out into a distribution network, eventually reaching buildings where it regulates temperature through boilers. When a data center can provide water at 50° Celsius (122° Fahrenheit), this represents a significant opportunity to reuse energy already consumed for computing. The economics are compelling. “We’ve already paid for the energy once,” Scott said. “So at that point, why not use it again? And that’s where thermal recovery is really useful.”

Practical Deployment Considerations

Immersion cooling shows strong return on investment above 40 kilowatts per rack, and the technology becomes necessary at 100 kilowatts and beyond. As advancement of graphics processing units (GPUs) drives power densities higher, direct liquid cooling alone cannot solve the challenge. Peripheral components still generate heat requiring air cooling, straining facility infrastructure as power becomes the ultimate constraint.

The barrier that Midas faced for 15 years—data center operators’ resistance to liquid near equipment—has been addressed as organizations adopt rear-door heat exchangers and direct-to-chip cooling. “Many of the data centers, especially the ones that are focusing on machine learning and AI, are building water loops in the facility,” Scott said. “So that prerequisite is done. Then we need to start looking at the IT.”

The IT requirements are “quite a bit different.” One of the biggest changes? Fans are no longer needed, and the immediate benefit is significant. A one-kilowatt server that dedicates 150 to 200 watts to fans can complete the same compute at just 800 watts immersed.

The Midas Difference

What distinguishes Midas in an increasingly competitive market comes back to their operational heritage. Scott highlighted their truly concurrent maintainability and fault-tolerant design, which includes redundant cooling distribution units (CDUs) as standard. The system supports easily hot-swapping failed CDUs: with just an hour of education, the company’s global sales manager learned to hot-swap a CDU in seven minutes. The operational simplicity extends to deployment, as well. Scott described installing a system at a university in the United Kingdom in 40 minutes. “That’s an advantage of a Midas,” Scott said. “We had to maintain it ourselves, so we built it that way.”

TechArena Take

Midas’s journey from data center operator to immersion cooling provider demonstrates how real operational experience drives practical innovation. Their emphasis on user-friendly design addresses the large and small daily challenges that data center operators face. As compute densities continue climbing and power constraints tighten, immersion cooling is transitioning from alternative technology to essential infrastructure. Companies like Midas, with proven deployment experience and field-tested designs, are well-positioned to lead this transformation.

Learn more about Midas immersion cooling solutions at www.midasimmersion.com.

Particles creating the healthcare symbol

AI, the Silent Partner Mitigating Clinician Burnout

Burnout isn’t just a trendy term; it’s a real crisis. Doctors and nurses are feeling the weight of unprecedented stress, fatigue, and emotional exhaustion. With overwhelming administrative tasks, endless paperwork, and the constant pressure to provide top-notch care in a shorter time frame, the environment has become unsustainable. The outcome? Burnout rates soaring above 50% in certain specialties, which is leading to workforce shortages and putting patient care at risk.

Enter Artificial Intelligence (AI), a tool that can act as a supportive partner working quietly in the background to help restore balance.

The Burnout Epidemic

Healthcare professionals often find themselves spending almost half of their day on administrative tasks instead of focusing on patient care. While Electronic Health Records (EHRs) are crucial, they can also be a major source of frustration due to their complexity and the time they require. The issue of burnout doesn’t just impact the providers; it sends shockwaves throughout the entire system, affecting patient satisfaction, safety, and even the financial health of the organization.

AI as a Workflow Optimizer

One of the most impactful ways AI is changing the game right now is by taking over those tedious tasks that can really drain provider time. Smart systems are stepping in to handle things like scheduling appointments, verifying insurance, and even managing prior authorizations—jobs that used to consume hours of clinicians’ time. Then there are the more sophisticated tools, like ambient clinical intelligence, which can listen in during patient visits and automatically generate structured notes. This means healthcare providers can finally break free from the never-ending cycle of typing.

Imagine this: a provider finishes a consultation and the documentation is already taken care of—accurate, compliant, and ready for a quick glance. It might sound like something out of a sci-fi flick, but it’s happening right now.

Clinical Decision Support: Reducing Cognitive Load

Burnout isn’t just about the endless paperwork; it’s also tied to decision fatigue. Clinicians are constantly juggling a mountain of data, from lab results to imaging studies. That’s where AI-powered clinical decision support tools come in. They sift through all this information in real time, bringing forward actionable insights and highlighting potential risks. Instead of feeling overwhelmed by data, healthcare providers receive clear, evidence-based recommendations.

This doesn’t take the place of clinical judgment; it enhances it. By lightening the cognitive load, AI gives clinicians the freedom to concentrate on what truly matters: connecting with patients and showing empathy.

Real-World Examples

Ambient Documentation: Tools like Nuance’s Dragon Ambient eXperience (DAX) are already transforming workflows by capturing conversations and generating notes automatically.

Predictive Scheduling: AI models forecast patient flow and staffing needs, helping hospitals allocate resources efficiently and prevent overload.

Smart Triage: AI-driven triage systems prioritize cases based on urgency, ensuring critical patients receive timely care.

The Human-AI Partnership

AI can’t take the place of empathy, intuition, or that special human touch. What it can do is create an environment where those qualities can flourish. By handling administrative tasks and simplifying decision-making, AI allows clinicians to reclaim their most asset: time.

Future Outlook: AI as a Resilience Tool

As healthcare evolves, AI will become a cornerstone of provider well-being strategies. Beyond automation, expect predictive burnout analytics, systems that monitor workload patterns and flag early signs of stress, enabling proactive interventions.

By reducing administrative friction and cognitive overload, AI empowers clinicians to reconnect with their purpose: caring for patients. The future of healthcare isn’t man versus machine it’s man and machine, working together to restore balance and resilience.

Valvoline Global Operations Engineers Data Center Liquid Cooling

For more than 150 years, Valvoline has been synonymous with high-performance motor oil and racing heritage. Now, the company is applying its expertise to a very different kind of performance challenge: keeping AI data centers cool as they transition from the megawatt era into the gigawatt era.

In a recent conversation with Michael Morrison, director of new ventures at Valvoline Global Operations, and Solidigm’s Jeniece Wnorowski, I discussed how data center cooling represents a natural evolution for a company built on managing heat and performance. As Michael explained, Valvoline has actually maintained a data center presence for years, providing oils for backup power generation systems. The move into cooling solutions represents a deeper engagement with an industry facing unprecedented thermal challenges.

Two Liquid Cooling Solutions to Meet the Density Challenge

While rising temperatures grab headlines, Michael emphasized that density is the real challenge facing modern data centers. AI-enhanced workloads require packing more chips into the same physical space, creating concentrated heat loads that traditional air cooling cannot effectively manage. Liquid cooling enables increased density of chips per server and of servers per data center, fundamentally changing the economics of AI infrastructure deployment.

Two approaches to liquid cooling have arisen in response to this challenge: direct-to-chip cooling, and immersion cooling. Direct-to-chip cooling runs coolant through lines and cold plates to cool individual processors, and it has already moved beyond adoption into rapid growth. Major manufacturers have begun supporting this approach, and deployments have begun.

Immersion cooling, however, remains in earlier stages. In immersion cooling applications, entire servers are submerged in tanks filled with dielectric fluid. The approach allows heat to be captured from all components simultaneously. It also represents a large potential change for hyperscalers, which explains why it is still largely in proof-of-concept phase.

“They’re not used to having large open tanks sitting in their data centers,” Michael said. “So, they’re not only testing performance metrics, but understanding, ‘what is my maintenance on a server like?’ All of those things have to have operational procedure set: all the nuances of running it in a normal setting and an emergency environment.”

Precision Testing and Partnership: Engineering Compatibility at Every Level

The key to immersion cooling is dielectric oils. Dielectric materials, like the oils Valvoline produces, are nonconductive substances for electric currents. And finding a fluid with ideal properties to enable high performance is where Valvoline Global shines.

“We’re used to testing properties in our fluids that would determine, does it conduct electricity? Does it transfer heat?” he said.

While Valvoline Global’s fluid testing capabilities form the foundation, Michael emphasized that deploying these solutions successfully requires a more comprehensive approach. When servers are immersed in dielectric oil, compatibility becomes critical across thousands of individual components. Valvoline Global works closely with data center operators to ensure their fluids are compatible with specific hardware configurations, tank materials, and operational requirements. This collaborative approach, which the company has refined over more than 150 years of customer relationships, distinguishes their market strategy from simple product provision.

Sustainability Through Operational Efficiency

Beyond performance, liquid cooling addresses sustainability concerns that are becoming critical for data center operators. By reducing or eliminating large HVAC systems required for air cooling, facilities can significantly decrease power consumption and operational expenses. Water usage can also be reduced depending on system configuration. Michael noted that liquid cooling can creates a scenario where improved cost structure and reduced environmental impact work together rather than act as competing priorities.

The TechArena Take

Valvoline Global’s entry into data center cooling represents more than a company diversifying its product portfolio. It reflects how foundational technologies from established industries are being reimagined to solve the infrastructure challenges of AI deployment. As data centers grapple with the thermal and density challenges of AI-enabled workloads, Valvoline Global’s’s combination of fluid science expertise, collaborative approach, and long history of managing high-performance applications positions them as a meaningful player in this infrastructure evolution. For organizations planning liquid cooling deployments, the lesson is clear: success depends not just on the technology itself but on the partnerships and compatibility testing that ensure reliable, long-term operation.

Learn more about Valvoline Global’s data center cooling solutions at their website, valvoline.com, where they provide detailed technical resources on liquid cooling technologies. Connect with Michael Morrison on LinkedIn to continue the conversation about thermal management innovation.

Equinix on Architecting the AI-Ready Data Center

Inside Equinix and Solidigm’s playbook for turning data centers into adaptive, AI-ready platforms that balance sovereignty, performance, efficiency, and sustainability across hybrid multicloud.

6 Plays to Close the AI-Era Connectivity Gap

The foundation of the digital economy is buckling under the weight of its own success. artificial intelligence (AI) inference, real-time autonomous systems, and the explosion of edge computing are driving network demand far beyond what today’s infrastructure was designed to support.

This pressure is creating a pervasive state of digital asymmetry. The problem is no longer a simple binary of “connected” versus “unconnected.” Instead, it shows up as a spectrum of gaps in coverage, consistency, and resilience that threaten the promise of real-time, AI-driven services.

This playbook lays out the key principles and deployment patterns needed to close that gap with a converged, “all of the above” architecture that uses fiber, wireless, satellite, and free space optics (FSO) together instead of pitting them against each other.

Play 1: Redefine the Problem

Digital asymmetry describes the widening mismatch between where demand for high-quality connectivity is exploding and where networks can realistically deliver it. It manifests in three distinct, overlapping gaps.

The Unserved Gap: Roughly 2.6 billion people still lack reliable connectivity, often living in areas where fiber is fiscally impossible to deploy. In these regions, the economics of trenching long distances for a small number of subscribers simply don’t pencil out.

The Under-Served Gap: Millions more experience wildly inconsistent broadband. Service quality can vary street by street, constrained by aging copper, congested last-mile technologies, or a legacy plant that was never designed for high-throughput, low-latency AI-era workloads.

The Reliability Gap: Even in hyper-connected cities, capacity shortfalls and service outages are forcing enterprises and autonomous systems to rethink operations. Simple fiber cuts, severe weather, or temporary environmental interference can break the “real-time” promise that AI applications depend on.

The first shift in mindset is to stop thinking in terms of “connected vs unconnected” and start thinking about where and how these three gaps show up in your footprint.

Play 2: Admit Fiber Is Essential, but Not Sufficient

Fiber optic cable is the undisputed gold standard for modern broadband: high capacity (100 Gbps+), ultra-low latency (< 5 ms), and decades-long reliability. When it can be deployed economically, it is often the first and best choice.

But physics, time, and money place hard limits on what fiber can solve on its own.

Deployment timeline: Fiber projects are fundamentally linear and slow. Typical builds can take 12–18 months from planning to activation, and the bottlenecks are rarely technical. Permits, street closures for trenching, utility coordination, environmental reviews, and complex right-of-way negotiations can stall a single mile of deployment for half a year or more. Fiber scales linearly in a world where demand is growing exponentially.

Unfavorable economics: The cost of construction alone makes fiber infeasible in many regions. Urban builds often cost $30,000–$50,000 per mile. Rural deployments, where trenching crosses longer distances and serves fewer customers, can exceed $100,000 per mile. Extending connectivity into sparsely populated regions demands heavy capital investment, and the business case rarely works without substantial government subsidies.

Geography: Fiber requires a continuous physical path. Mountains, rivers, highways, rail crossings, and protected lands are not just obstacles; they are hard chokepoints that add months and millions to construction budgets. In many parts of sub-Saharan Africa, Southeast Asia, and rural America, avoiding these barriers is simply not practical.

Global funding doesn’t erase these constraints. The World Bank estimates that closing the global connectivity gap with fiber alone would cost more than $1 trillion and take decades. Policymakers have started to acknowledge this reality. The U.S. government’s $42 billion Broadband Equity, Access, and Deployment (BEAD) program, historically fiber-focused, is now open to high-performance wireless and satellite alternatives.

Fiber is therefore essential, but not sufficient. Even with aggressive funding, it cannot close every capacity and reliability gap on its own.

Play 3: Use Each Medium Where It Wins

If fiber is the backbone, the next step is to treat every other transport medium as a specialist, not a generalist. The goal is to use each technology where its physical and economic profile is strongest.

Fiber Strengths

Unmatched capacity per strand (100+ Gbps)

Ultra-low latency

Immunity to weather

Multi-decade lifespan

Fiber Weaknesses

Slow, permit-bound deployment

High cost per mile

Vulnerable to geographic and right-of-way constraints

Fiber is best used for dense urban cores, data center interconnects, and backbone routes where capacity and long-term value justify the investment.

Radio Frequency Wireless Strengths

Fixed wireless access (FWA) is booming, delivering rapid last-mile access to millions of homes and businesses.

RF systems are mobile-friendly, deploy in weeks to a few months, and benefit from a mature ecosystem that includes high-capacity backhaul (for example, E-band or microwave up to 10 Gbps).

RF Wireless Weaknesses

RF spectrum is finite, costly, and congested in dense markets.

Capacity per sector is limited, and densification through small cells (often $50K+ per site) requires robust backhaul.

RF wireless is best used for suburban and rural access, mobile coverage, and as a flexible complement to fiber for last-mile connectivity.

Free-Space Optics (FSO) Strengths

FSO delivers fiber-class capacity (10–100 Gbps) and low latency (< 5 ms) without trenching or licensed spectrum.

It can be deployed in days, at a fraction of the cost of equivalent fiber, and modern systems use forward error correction (FEC) to smooth minor interruptions from birds, dust, or other transient obstructions.

FSO Weaknesses

Requires line-of-sight; buildings, trees, and terrain can block signals.

Performance degrades in dense fog or heavy rain, which is why FSO is often paired with a backup RF link for hybrid availability.

FSO is best used for urban backhaul where trenching is impossible or prohibitively expensive, short-span “fiber gap” bridges, and enterprise sites with clear line-of-sight and a secondary path for redundancy.

Here’s a real-world example: In Lagos, Nigeria, operator MainOne used FSO to connect 20 enterprise buildings in three months—a project that would have taken roughly 18 months and cost about five times more using fiber alone. The FSO links deliver 10 Gbps with 99.9% uptime, and the approach is now being extended to residential areas.

Satellite Low Earth Orbit (LEO) Strengths

LEO constellations can deliver immediate coverage almost anywhere on Earth, without terrestrial build-out.

They are ideal for remote regions, maritime and aviation connectivity, and rapid deployment in disaster zones.

Satellite (LEO) Weaknesses

Higher latency than terrestrial fiber (typically 20–40 ms), limited per-user capacity (often 50–200 Mbps), and higher cost per Mbps (for example, $100–$200 per month for residential plans).

As subscriber density increases, congestion risk grows.

Satellite LEO is best used for: Remote and rural regions with no viable terrestrial options, backup connectivity for critical infrastructure, and mobile platforms such as ships, planes, and vehicles.

The point is not to crown a new winner. It is to match each medium to the situations where it delivers the best combined outcome on speed, cost, and reliability.

Play 4: Design Hybrid

The real gains come when you design networks as hybrid from the start, instead of treating non-fiber technologies as temporary workarounds. Optimal Hybrid Placement means planning fiber, RF, FSO, and satellite together, assigning each to the roles where they are physically and economically strongest.

Consider an illustrative scenario from rural Montana. A regional internet service provider (ISP) needed to connect 5,000 homes across 200 square miles of mountainous terrain. A fiber-only design was estimated at $80 million and a four-year timeline.

Instead, the ISP built a hybrid network:

Satellite (for example, Starlink) provided immediate coverage for the most remote 20 percent of homes—about 1,000 subscribers.

Fixed wireless (Citizens Broadband Radio Service) served suburban clusters of roughly 3,000 subscribers, delivering 100–300 Mbps service.

FSO links provided multi-gigabit backhaul between wireless tower sites and existing fiber points of presence, avoiding more than 40 miles of difficult trenching.

Fiber connected the overall network to the regional backbone at two strategic points.

The results were decisive: the network could launch in about nine months, at a cost of $32 million—roughly 60 percent less than the fiber-only design and about four times faster to deploy. Average subscriber speeds were approximately 200 Mbps.

Hybrid in this context is not a compromise. It is the only approach that can simultaneously hit the necessary targets for speed, cost, and coverage across challenging geographies.

Play 5: Tame the Operational Offset

There is, however, a tradeoff. Hybrid architectures lower upfront capital costs but drive up operational complexity. That operational offset is the real barrier to wide-scale adoption.

Running four distinct platforms—fiber, RF, FSO, and satellite—means managing different vendors, different skill sets, and more complex provisioning and monitoring. Orchestrating seamless handoffs between dissimilar technologies, while maintaining session continuity and quality of experience, adds real operational risk.

Even where the technology and economics are well understood, three systemic factors are slowing hybrid adoption:

Operator inertia: Many ISPs and telcos are built around fiber-first strategies. Procurement processes, vendor relationships, and engineering organizations have been optimized for fiber deployments for years. Shifting to hybrid requires retraining teams, rethinking network design, and in some cases revisiting business models.

Standards and interoperability: Hybrid networks need to support continuous sessions that may move from fiber to FSO and then fail over to RF or satellite without breaking the application. Industry standards and orchestration frameworks for multi-technology handoff are still maturing, and operators are understandably cautious about stitching together complex, multi-vendor stacks.

Financing and policy models: Government subsidies—such as BEAD in the U.S.—have historically been written in technology-specific terms. Grants may favor “deploy fiber to this area” rather than “deliver 100 Mbps service to these households by any reliable means.” Hybrid approaches do not always fit cleanly into these rules, which can discourage operators from pursuing them even when they make technical and economic sense.

At this point, the bottleneck is less about whether hybrid can work, and more about whether operators, vendors, and regulators can align operational models and policy frameworks to make it manageable at scale.

Play 6: Build for Dissimilar Redundancy and AI-Era Resilience

The final play is to design not just for coverage and capacity, but for AI-era resilience. As AI, autonomous vehicles, and distributed industrial internet of things (IoT) systems demand near-perfect uptime, traditional notions of redundancy are no longer enough.

A network built on redundant fiber may look robust on paper, but if both routes follow the same right-of-way, a single flood, wildfire, or backhoe cut can take them down together. Like-for-like redundancy cannot protect against shared failure modes.

True resilience requires dissimilar redundancy. That means pairing different transmission mediums so the failure mode of one is covered by the strength of another:

An RF backhaul link that is susceptible to rain fade can be paired with a backup FSO path that is less affected by that specific condition, or vice versa.

A terrestrial path can be backed by a high-reliability satellite failover that keeps critical services online even when ground infrastructure is disrupted.

This multi-layered defense elevates connectivity from a utility to an enabling platform for the future. It is the difference between “usually on” and “designed to stay on” when the environment, demand profile, or threat landscape shifts suddenly.

The Leadership Challenge

The technology to close the connectivity gap already exists, and the economics can work. The harder part is the mindset shift. Vendors must see themselves as collaborators first, working together to grow the overall market, and competitors second. Regulators must evolve funding models from technology mandates to outcome-based targets. Operators must be willing to move beyond fiber-first orthodoxy and design converged networks from day one.

The question is no longer whether converged, hybrid networks will dominate. The question is which organizations will lead this transformation, building the resilient, five-nines infrastructure that the AI future will depend on.

Connected data nodes with central processing hub

OCP 2025: Arm’s Chiplet Play Aims To Democratize AI Compute

During the recent OCP Summit in San Jose, Jeniece Wnorowski and I sat down with Eddie Ramirez, vice president of marketing at Arm, to unpack how the AI infrastructure ecosystem is evolving—from storage that computes to chiplets that finally speak a common language—and why that matters for anyone trying to stand up AI capacity without a hyperscaler’s deep pockets.

Two years ago at OCP Global, Arm introduced Arm Total Design—an ecosystem dedicated to making custom silicon development more accessible and collaborative. Fast-forward to this year’s conference, and the program has tripled in participants, with partners showing real products both in Arm’s booth and in the OCP Marketplace. That traction sets the backdrop for Arm’s bigger news: an elevated role on OCP’s Board of Directors and the contribution of its Foundational Chiplet System Architecture (FCSA) specification to the community.

Why should operators, builders, and CTOs care? Because the cost and complexity of building AI-tuned silicon is still brutal. Depending on the packaging approach—think advanced 3D stacks—Eddie put the total bill near a billion dollars. That number alone has kept bespoke designs out of reach for all but a few. The chiplet vision changes the calculus: assemble best-of-breed dies from different vendors rather than funding a monolith. But the promise only holds if those chiplets interoperate cleanly across more than just a physical link.

That’s the gap FCSA endeavors to fill. It goes beyond lane counts and bump maps to define how chiplets discover each other, boot together, secure the system, and manage the data flows between dies. If it works as intended inside OCP, we are an inch closer to a real chiplet marketplace—mix-and-match components with predictable integration, not months of bespoke glue logic.

Ecosystem is the keyword here, and not just for compute. Eddie spoke to collaborations across the platform, including within storage, as a case in point. Storage is stepping into the AI critical path, not simply holding training corpora but participating in the performance equation. AI at scale turns every subsystem into a performance domain. If data can be prepped, staged, filtered, or lightly processed closer to where it lives, you free up precious GPU cycles and avoid starving accelerators. Expect to see more of that thinking show up across NICs, DPUs, and smart memory tiers.

There’s also a geographic angle that’s difficult to ignore. Several of the newest Arm Total Design partners hail from Korea, Taiwan, and other regions actively cultivating their own semiconductor ecosystems. That matters for resilience and supply, but also for innovation velocity. When the entry ticket to custom silicon comes down, you get more specialized parts serving narrower, high-value slices of AI workloads—think tokenizer offload, retrieval augmentation helpers, or secure inference enclaves woven into the package fabric.

Underneath the product updates is a posture shift: lead with others. The Arm Total Design ecosystem is designed for co-design, not solo heroics, acknowledging that no one player can keep up with AI’s pace alone. OCP, with its bias toward open specs and reference designs that ship, is a natural forcing function. Putting FCSA into that process doesn’t just rack up community points; it pressures the spec to survive real-world scrutiny—power budgets, thermals, board constraints, and the ugly details that tend to eat elegant diagrams for breakfast.

If you’re operating AI clusters today, you’re already feeling the ripple effects. Racks are transitioning from steady-state power draw to spiky, sub-second pulses. Data movement is the enemy. The “box-first” era is fading into a rack- and campus-first design ethic where each layer—power delivery, cooling, storage, fabric, memory, compute—must flex in concert. Chiplets slot into that future because they can accelerate specialization at the silicon layer while OCP standardization tames integration higher up the stack.

What should you watch next? Three signals. First, real FCSA-based silicon or reference platforms that demonstrate multi-vendor die assemblies with clean boot and security flows. Second, storage and memory vendors showing measurable end-to-end gains on AI pipelines when compute nudges closer to data. Third, OCP Marketplace listings that move from reference intent to deployable inventory you can actually procure for pilot workloads.

If the last two years were about proving that chiplets are technically feasible, the next two will test whether they’re operationally adoptable. Specs are necessary; supply chains and service models are decisive. The teams that align those pieces—across vendors, geographies, and disciplines—will dictate how fast AI capacity gets cheaper, denser, and more power-aware.

TechArena Take

The AI build-out is colliding with real-world constraints—power, thermals, and capital. Ecosystems that compress time-to-specialization without exploding integration cost will win. Arm’s OCP board seat plus the FCSA contribution is a smart bet that interoperability is the bottleneck to unlock. If FCSA becomes the lingua franca for chiplets, operators could see a practical path to tailored silicon without a billion-dollar entry fee. Pair that with smarter storage and memory paths, and you start to chip away at the two killers of AI efficiency: idle accelerators and stranded data. The homework now is ruthless validation: put these pieces under AI-class loads, measure tokens per joule, and prove that “lead with others” doesn’t just sound good on stage—it pencils out in the data center.

Robotic arm creating sparks inside factory

Physical AI in Production: Datara AI’s Data-Loop Edge Playbook

The next wave of AI won’t live in a data center—it will weld seams, pick bins, and navigate factories alongside people. Physical AI brings intelligence into machines that perceive, decide, and act at the edge, closing the loop between perception and action.

Real-world factories introduce drift, glare, vibration, dust, and unpredictable human behavior—conditions that most models never see in simulation.

For IT and cloud architects, this is a stack problem with hard requirements: real-time inference under adverse conditions, data pipelines that span OT and IT, and operational discipline that turns pilot demos into consistent, reliable uptime. The gap between ‘works in simulation’ and ‘works every day on the line’ remains the main blocker to scale.

But it’s also a workforce problem. Robots reduce injury risk in hazardous tasks, but displacement effects are real and localized. The question isn’t “robots: yes or no?” but “how do we deploy them responsibly with both operational rigor and a workforce plan?”

Why the Market is Ready 

Three curves are converging: cheaper edge compute and sensors, strong perception models, and maturing MLOps for robotics. The International Federation of Robotics reports 542,000 industrial robots installed in 2024—the fourth consecutive year above 500k—with global demand doubling over the past decade. Industrial AI spending is projected to grow from $44 billion in 2024 to $154 billion by 2030. 

Standards are accelerating deployment. ROS 2/DDS, OPC Unified Architecture, and OCP’s rack-level guidance are pushing interoperability across sensors, controllers, and training infrastructure. The blockers are shifting from feasibility to integration discipline and change management - not ‘Can we automate?’ but ‘Can we maintain reliability when conditions vary beyond 5–10% of training data?’

Industry-grade platforms now stitch together simulation, data pipelines, and robotics foundation models. NVIDIA’s Omniverse plus Isaac tools let teams generate synthetic data, train policies in digital twins, and validate behaviors before touching a live cell—shrinking iteration from months to days. The missing piece is capturing the tribal knowledge of veteran technicians and encoding it into recovery behaviors robots can execute.

The Architecture That’s Working

I spoke with Durgesh Srivastava, CTO of DataraAI, at the recent OCP Global Summit about what separates production systems from pilot theater. His outlook is pragmatic: target full automation for bounded task families, match human quality, and build graceful fallbacks when reality goes off-script. 

DataraAI provides a data engine for physical AI—a data-as-a-service platform that transforms factory experience into machine intelligence. It captures how technicians act, how robots fail, and how edge conditions drift, creating the data foundation real-world robotics has always lacked.

The company emphasizes three pillars: 

1. Egocentric Multi-Modal Data Capture – Robots learn from the same viewpoint they act in. DataraAI’s robot-mounted and wearable sensors record RGB-D vision, IMU, tactile, and audio data from real operations. This captures the nuanced cues—force patterns, drift, micro-failures—that static cameras miss.

2. AI-Driven Annotation – DataraAI’s engine automatically labels rare and high-impact events—fires, spills, breakdowns, human-robot handoffs—turning chaos into structured data. It consistently captures scenarios that traditional CV pipelines fail to label or detect.

3. Continuous Learning Loop – New anomalies are fed back into the data engine. Each cycle makes models more resilient and accurate in the field. Every exception becomes new training data, creating a self-improving loop tied directly to real operations.

Early industrial pilots using this loop showed a 53% accuracy lift and 67% better edge-case handling—clear evidence that real-world data closes the performance gap.

The winning pattern is consistent: push perception and control onto the robot, treat the cloud as training and update infrastructure, and run a disciplined data loop that captures real-world anomalies. In harsh conditions—glare, occlusion, spark bursts—edge models must keep the task running. Inference runs locally on the robot, keeping factory data on-site, reducing latency, and enabling real-time adaptation to drift and anomalies. The back end aggregates field data and pushes lightweight updates routinely. That loop turns demos into dependable production.

High-fidelity simulation generates diverse synthetic data and lets teams rehearse rare events safely. Foundation models provide generalizable priors; reinforcement learning in sim refines task skills before transfer to the real world. Edge inference runs locally; telemetry goes upstream for labeling and augmentation; new policies return during maintenance windows. Daily updates are feasible when you structure the pipeline—this reduces drift and grows edge-case coverage.

The Workforce Reality 

Case studies show robots cut musculoskeletal risk and reduce exposure to hazardous tasks like forging and welding. Collaborative-robot safety standards (ISO 10218, ISO/TS 15066) and OSHA guidance formalize safe human-robot interaction.

But displacement effects are real. MIT research found that each additional industrial robot per thousand workers reduced employment and wages in affected commuting zones—a meaningful impact not fully offset by productivity gains. The IMF estimates about 40 percent of jobs worldwide are exposed to AI impact; roughly half may see productivity augmentation, while the rest face reduced labor demand without intervention.

The macro takeaway: adoption will rise, safety can improve, and some roles will shift. For architects presenting to leadership or  works councils, the deployment plan must address both.

TechArena Take 

Physical AI is crossing from pilot theater to production credibility.

The durable advantage comes from operational muscle: how quickly you spin the loop from floor data to better policies and back to the line, while moving people from hazardous repetitive work to higher-value tasks with clear retraining paths. Start with one cell, one exception, and a fallback procedure. Prove you can turn tribal knowledge into machine-executable behaviors and safety risks into measurable improvements. Then scale the loop. That’s the compounding edge in physical AI.

Inside the New Rules of Responsible AI Governance

AI is no longer just powering apps; it is determining credit, authorizing vendors, and deciding who to grant access to critical services.

It has moved from research centers to the heart of our financial institutions, healthcare systems, online retail sites, and even government agencies. But with such rapid proliferation, there is fierce scrutiny. The question being asked in boardrooms, policy circles, and living rooms is simple: How do we make AI fair, transparent, and accountable?

This is where Responsible AI governance becomes imperative. Responsible AI is ultimately about trust-building, creating systems that are safe, ethical, and respectful of human values. It’s about putting guardrails in the design, development, and deployment of AI to ensure a balance between risks and innovation.

Above all, Responsible AI is not something that can be managed within the confines of a single company. It extends to the whole ecosystem of users, regulators, and partners. Whether it’s banks complying with global anti-money laundering rules, or e-commerce platforms authenticating sellers without bias, governance involves cooperation and shared standards.

And though experts refer to it in various ways, “trustworthy AI,” “ethical AI,” or “principled AI,” the goal is the same: maximizing the value AI generates while minimizing the risks. That includes making sure systems continue to be reliable throughout their lifecycle, eliminating bias, securing data, and ensuring decision-making can be explained.

Defining Responsible AI Governance

The answer to the question of “how do we make AI fair, transparent, and accountable?” lies in Responsible AI governance, a set of principles, policies, and practices that guide how AI is developed, deployed, and governed.

While no single definition exists yet, governments, researchers, and businesses are all at least united on this: responsible AI is building trust. Different frameworks place emphasis on different aspects. For example, the European Union's High-Level Expert Group on AI refers to AI as lawful, ethical, and resilient. Singapore's guidelines place a focus on transparency, fairness, and human-centric design. And big tech has emerged with its own approaches, requiring explainability, accountability, and safeguarding against bias.

Simply stated, “responsible” can mean very different things based on who you talk to. But the shared purpose is clear; AI should work for people, not against them. It needs to augment human choice and protect individual rights and societal values.

Principles of Responsible AI

Across numerous frameworks, a shared set of principles has come to the fore. They are not philosophical constructs; they are practical standards that all organizations ought to remember while applying AI:

Robustness and Safety – The systems must be resilient against errors, adversarial attacks, or misuse. In practice, it means stress-testing AI models, watching out for drift, and building contingency plans.

Inclusivity and Fairness – AI should not amplify human bias. Banks, for instance, must ensure certain credit models do not unfairly reject certain groups. Online shopping sites must prevent recommendation engines from amplifying discrimination.

Privacy and Security – Data is the lifeblood of AI but mismanaging it undermines trust. Robust data governance, transparent documentation, and adherence to regulations such as GDPR are now minimum expectations.

Explainability and Transparency – If users can’t see how a model is making decisions, they won’t trust it. Explainable AI (XAI) tools are becoming essential for compliance and customer trust.

Accountability and Governance – Humans need to be in charge ultimately. That equates to clear oversight structures, audit trails, and escalation paths when AI is incorrect.

By integrating these principles into operations and strategy, organizations achieve a balance between innovation and protection. Done correctly, Responsible AI is a source of competitive advantage rather than a compliance exercise.

The Policy Landscape: U.S. and Europe

Governments are not sitting on the sidelines watching AI progress; they’re making the rules of the game.

United States: The White House introduced the Blueprint for an AI Bill of Rights in 2022, outlining five principles to which all AI systems must adhere: safety, non-discrimination, data privacy, transparency, and the right to a human alternative. The National Institute of Standards and Technology (NIST) thereafter published its AI Risk Management Framework (2023), which, while voluntary, has become the de facto business playbook for those wanting to prove their AI is trustworthy.

At the state level, momentum is also gaining. Colorado passed the country’s first state-wide AI law in 2024 that requires companies to assess and minimize algorithmic bias in high stakes uses such as employee recruitment and credit.

Europe: The European Union took it further with the AI Act, implemented in August 2024. It is the first legally binding law of this kind anywhere and adopts a risk-based approach.

The financial industry illustrates the stakes. AI already dominates fraud detection, credit scores, risk management, and robo-advisory services. While these technologies bring efficiency and inclusivity, regulators want them to also be explainable, fair, and secure. Under the AI Act, even general-purpose AI systems such as generative systems fall under transparency obligations such as labeling AI-created content or flagging deepfake content.

Enforcement is not an afterthought. Fines of up to €30 million or 6% of global revenue have been set by the EU for wrongdoing. In the United States, regulators such as the FTC and CFPB are increasingly framing biased or deceptive AI systems for consumer protection violations, suggesting that more stringent enforcement is in the pipeline.

Why Policymakers Care

For governments, Responsible AI governance is much more than compliance. It is a competitiveness factor, a citizens’ protection factor, and a matter of establishing trust globally. Policymakers face the dual challenge of driving innovation while requiring safety provisions to safeguard people.

Consider the banking sector. Banks utilize AI to inform credit decisions, fraud detection, and anti–money laundering (AML) systems. If biased or opaque, they can discriminatorily reject customers, drown compliance teams with false alarms, or even create systemic financial risk. Regulators like FinCEN in the United States and the European Banking Authority in the EU therefore emphasize explainability and fairness in AI-based AML systems.

The e-commerce sites themselves are not immune to similar risks. AI powers seller sign-up, product suggestion, and content moderation. Without regulation, the same technologies can facilitate fraud, permit misrepresentation, or result in biased conclusions for sellers and buyers. The consequences are trust erosion and risk of regulatory fines.

The Path Forward

Responsible AI governance is not a bucket list; it is a collective sense of responsibility. For organizations, it is about embedding AI principles into customer experience, compliance infrastructure, and corporate brand. For policymakers, it is about creating guardrails that are enforceable but support innovation. For technologists and researchers, it is about creating tools of explainability, resilience, and fairness.

If done effectively, governance contributes trust and creates enduring value. If neglected, threats, discrimination, misinformation, and systemic flaws can overshadow rewards.

Responsible AI is ultimately the cornerstone of the long-term future of technology. For policymakers, it saves rights. For companies, it protects reputation and maintains compliance. For society, it ensures that technology supports human values.

Efficiency: The New Moat for Data and AI Teams

For years, the tech world equated innovation with scale: more data, more models, more compute. But 2025 has revealed a different truth. Scale alone is no longer a differentiator. The most forward-thinking data and AI teams are still innovating, but they are doing it by designing for efficiency—building smarter, not just bigger.

Across industries, leaders are realizing that intelligent design—not brute force—drives lasting progress. As cloud budgets rise and sustainability becomes a board-level priority, the smartest teams are treating efficiency as strategy, not just cost-cutting. According to Gartner’s Top Trends in Data & Analytics for 2025 report, data initiatives are shifting “from the domain of the few to ubiquity”—and leaders now face pressure “not to do more with less, but to do a lot more with a lot more.”

McKinsey, in its Seizing the Agentic AI Advantage report, finds that companies succeeding with AI are the ones that optimize every layer of their technology stack for speed and cost.

Efficiency Is the Real Moat in AI and Data Engineering

The edge in AI no longer lies in model complexity; it lies in how well teams orchestrate their resources. A group that can run the same workload faster or cheaper instantly earns more room to innovate.

Yet cloud waste remains immense. Organizations lose an estimated 30 % of their cloud budgets — according to Flexera’s State of the Cloud Report — due to idle or misallocated resources. Progressive teams are embedding FinOps dashboards directly into their pipelines, tracking cost, carbon, and performance in real time.

Efficiency has evolved from a side project to a design philosophy. It now helps determine which teams survive budget cuts and which scale with confidence.

Architecture Matters More Than Algorithms

Generative AI put algorithms at the center of attention, but it is architecture that sustains innovation. The strongest data platforms today are modular, event-driven, and self-healing.

Traditional ETL pipelines are being replaced by composable frameworks built on open formats such as Iceberg and Delta Lake. These modern table architectures enable schema evolution, time travel, and cost-efficient versioning. Databricks, in its The Future of the Modern Data Stack webinar, notes that open-standards and flexible architectures are dramatically simplifying enterprise data platforms.

True innovation happens when systems are simple to extend, easy to test, and quick to evolve. Big no longer means better. Adaptable does.

Sustainability Metrics Will Shape the Next Wave of Innovation

As AI workloads grow, energy transparency is becoming inseparable from performance. Cloud providers are now publishing sustainability data alongside billing metrics, allowing engineers to see the environmental impact of every query.

Microsoft’s Cloud for Sustainability platform and Google Cloud’s Carbon Footprint tool, for example, provide visibility into energy use per workload. This turns sustainability from a talking point into a measurable engineering discipline.

By 2026, success will depend not only on how fast teams generate insights but also on how efficiently they convert energy into intelligence. The most forward-thinking innovators will measure their progress in joules as carefully as they do in dollars.

Constraints Drive Creativity

It is a common belief that efficiency stifles experimentation. In practice, it often does the opposite. When teams have to work within limits, they tend to think deeper, design cleaner, and test smarter.

Harvard Business Review’s Why Constraints Are Good for Innovation article shows that when teams embrace constraints, they tend to focus on what truly matters—often generating more original and effective ideas.

In data engineering, those constraints spark leaner algorithms, reusable components, and automation breakthroughs. Efficiency, when embraced thoughtfully, becomes a powerful catalyst that channels innovation instead of constraining it.

Efficiency Is Becoming the Language of Leadership

CFOs, CTOs, and sustainability officers now share a common language built on efficiency. They talk about cost per insight, energy per transaction, and governance per gigabyte. Success is no longer measured only by how much was delivered, but by how responsibly it was achieved.

Leaders who once cared only about uptime now care about utilization curves and carbon intensity. This cultural shift shows that efficiency is no longer an operational concern; it has become a leadership mindset that connects finance, engineering, and sustainability goals.

These trends point to a clear reality: efficiency is no longer a constraint. In practice, efficiency is the price of admission for sustainable innovation at scale.

Conclusion

Efficiency is not the opposite of innovation; it is how leading teams make their innovation durable and scalable.

As the excitement around massive AI models begins to settle, the real winners will be the teams that engineer with discipline, measure with integrity, and optimize with purpose. The future belongs to those who understand that every dataset, every compute cycle, and every design choice carries a cost.

True innovation means creating maximum impact with minimal waste.

How is your organization redefining innovation through efficiency?

Glass shield sitting on top on CPU's in a computer

Resilience in the Age of AI: Inside Commvault’s ResOps Pitch

If you listen to enough AI keynotes, you start to hear similar refrains: AI is transformative, the pace is unprecedented, and security hasn’t kept up. What was different at Commvault’s SHIFT event was less the diagnosis and more the operating model they’ve put around it: ResOps and Unity.

Commvault’s leadership argued that cyber resilience needs a new name, a new architecture, and a promotion in the enterprise hierarchy. They call their answer “ResOps”—resilience operations—and they introduced Commvault Cloud Unity, a unified platform that embodies that ResOps model across security, identity, and recovery.

You don’t have to buy into the branding to see the signal: resilience is being pulled out of the back office and moved to the center of how AI-era infrastructure is designed and run.

From Cyber Resilience to AI Resilience

Two years ago, Commvault elevated data protection into a more strategic posture they call “cyber resilience,” emphasizing that data protection is more than a last-line-of-defense tape in a vault. At SHIFT, CEO Sanjay Mirchandani pushed that idea further: in an AI-first world, resilience isn’t just about systems and data anymore; it’s about how thousands of autonomous agents interact with those systems and data in real time.

The framing is straightforward:

Generative and agentic AI are driving an explosion in data volume, variety, and sensitivity.
AI systems are increasingly being used by attackers as well as defenders.
Identities are proliferating, with non-human and machine identities outnumbering humans.

In that context, Mirchandani argued that “AI resilience” requires three things to move in lockstep: security, identity, and recovery. If any one of the three lags, AI becomes a new fragility multiplier instead of a growth engine.

Many large enterprises are already living this reality: fragmented data estates, software as a service (SaaS) and cloud-native sprawl, and a rising tide of identity-driven attacks. SHIFT’s contribution is to put a more opinionated operating model around those forces and to insist that resilience needs its own closed loop.

ResOps: Category Creation or Real Operational Shift?

ResOps, as Commvault describes it, is a continuous loop across three stages:

Understand and govern data and identities (who or what is accessing what, and under which policies).
Detect anomalies and threats in near-real time, across both identities and data.
Recover cleanly and predictably at scale, with as little data loss as possible.

On paper, that sounds familiar. Security teams talk about “detect, respond, recover” all the time. What Commvault is doing is pulling data protection and identity recovery into that motion as first-class citizens, rather than something the security team hands off to infrastructure after the incident is contained.

ResOps is less about inventing a new discipline and more about admitting that the old silos are breaking down.

In many organizations today:

SecOps, identity teams, and backup teams all operate with different tools, policies, and metrics.
Automation exists, but it’s often local to a domain: SOAR (security orchestration, automation, and response) playbooks, runbooks in ITSM (IT service management) tools, scripting in backup platforms.
AI is introduced at the edges—a chatbot here, anomaly detection there—without a single control plane that understands resilience end-to-end.

What Commvault is really arguing for is convergence: one fabric that connects identity posture, data governance, threat signals, and recovery orchestration. Whether you call that ResOps or just “finally connecting the dots” is semantics, but the direction of travel is clear across the industry.

Identity Becomes the New Perimeter

One of the more grounded sections of the SHIFT program focused on identity resilience. The thesis: if identity is the new perimeter, then identity recovery and forensics have to be just as mature as server and storage recovery.

A few key points stood out:

Most breaches still start with stolen or misused credentials, according to industry data like CrowdStrike’s ~80% figure cited on stage.
Machine identities and service accounts are rapidly becoming one of the dominant attack vectors, especially as automation proliferates.
Traditional identity recovery is either too coarse (all-or-nothing forest restore) or too manual for crisis conditions.

Commvault’s answer is a set of capabilities around Active Directory and Entra ID that continuously audit changes, flag risky privilege drift, and allow rollbacks of specific changes or entire “attack chains.” In their demo, a compromised service account quietly spreads a malicious group policy; the platform detects the pattern, allows an operator to unwind the changes, and then feeds that insight back into a vulnerability view.

It’s interesting that identity recovery and identity analytics are now being positioned as central pillars of resilience, not niche features. As AI agents increasingly act on behalf of users and services, the blast radius of a compromised identity gets bigger. The ability to unwind that blast radius precisely—without flattening an entire domain—will matter more than it has in the past.

Clean Recovery as a First-Class Outcome

Another recurring theme in the keynote was the “billion-dollar question”: when you recover, how do you know the data is both clean and current?

Traditionally, recovery teams have had to choose:

Roll back further in time to ensure a clean copy, and accept more data loss.
Stay as close to the event as possible and risk reintroducing malware.

Commvault’s proposed answer is an approach they call synthetic recovery, paired with threat scanning and cleanroom testing. Conceptually, it works like this:

Scan backups with multiple signals (anomalies, encryption patterns, malware signatures, and external threat intel).
Use that understanding to automatically assemble a composite recovery point that pulls in only the last known-good versions of corrupted files.
Rebuild systems into an isolated “cleanroom” environment using golden images, then reattach the cleaned data for validation before going back to production.

Embedded in this approach is an important shift: recovery is no longer just about hitting a recovery point objective/recovery time objective (RPO/RTO) number. The new bar is “provably clean” plus “minimally lossy,” with a testable chain of evidence you can show to a CISO, a regulator, or your own board.

That’s a much harder problem than it sounds, and vendors across this space are still evolving their answers. But the directional signal is right. As AI accelerates both attack automation and business reliance on data, the cost of a “dirty” recovery—one that quietly reintroduces the threat—gets higher every year.

Cloud, On-Prem, and the Unity Story

Unity, as positioned at SHIFT, is Commvault’s attempt to bind together three worlds under one control plane:

SaaS workloads (M365, Google Workspace, Salesforce, DevOps platforms, and more).
Cloud-native stacks across AWS, Azure, and Google Cloud.
On-prem environments using their Hyperscale appliances and reference architectures.

Again, the specifics are vendor-branded, but the pattern is market-wide. Enterprises don’t live in one world anymore. A single business process might touch Kubernetes, SaaS customer relationship management (CRM), cloud databases, edge stores, and an on-prem analytics farm. Resilience that stops where a hyperscaler’s responsibility ends is no longer enough.

The architectural bet we’re seeing is:

Separate control planes from data planes.
Scale protection and recovery elastically in the cloud, while allowing enterprises to bring their own storage and appliances where they need to.
Wrap everything in a common policy model and observability layer so you can reason about posture across environments, not per tool.

Unity is one version of that story. Other vendors are building their own versions.

The TechArena Take

If we zoom out from the SHIFT announcements and marketing language, a few broader trends come into focus:

Resilience is becoming an operating model, not a product line: Boards and CEOs are now asking “how fast can we recover from the inevitable?” in the same breath as “what is our AI strategy?” That pushes resilience into day-to-day operations and out of the “insurance” category.
Identity and data are converging in resilience conversations: The old model treated identity as identity access management’s (IAM’s) problem and backup as an infrastructure problem. AI collapses that separation. In an agentic world, identity mistakes and data mistakes are tightly coupled.
“Clean” is the new RPO: Recovery objectives are no longer just about how much data you lose or how fast you come back. They are about how certain you are that you haven’t re-imported the adversary into production.
AI is both the accelerant and the tool: The same AI that makes it easier to discover vulnerabilities and automate attacks is also being harnessed to correlate signals, propose recovery points, and orchestrate complex workflows. The arms race is well underway.
SHIFT doesn’t change the fundamentals: Enterprises still need clear ownership across SecOps, identity, and infrastructure. They still need to rationalize tool sprawl and understand where each platform begins and ends. And they need to test recovery assumptions in realistic scenarios, not just on paper.

What SHIFT underlines is that resilience is now part of the AI conversation, not an afterthought. As enterprises experiment with AI factories, agentic systems, and data-native product development, the resilience stack underneath is being reimagined just as aggressively as the AI stack on top.

In the arena, that’s the story to watch: not which platform has the most features this quarter, but which operating models help enterprises withstand—and learn from—the inevitable failures that come with AI at scale.

DNA made out of glowing fiber optic strands

Inside the New Product DNA: AI, Causality, Experimentation

For decades, intuition, gut feel, and post-launch hindsight have driven product development. Teams brainstormed, launched features, and hoped for the best. Success stories were celebrated, and failures were dismissed as bad timing.

Today, that’s no longer the case. The new DNA of the product is intrinsically data-native and experiment-driven, where AI meets causal inference meets automation.

In this new paradigm, learning is the product itself. Every release, click, and interaction feeds a continuous feedback loop that helps products evolve faster and more intelligently.

Evolution of Product Development

Each phase has shortened the learning cycle—from slow post-mortem reviews to real-time decision-making. AI now serves as the connective tissue, turning every user signal into actionable insight.

How AI is Rewriting Product Development

AI’s impact is not just in automation, it’s in amplifying human experimentation capacity. Here’s how it’s transforming each layer of the product lifecycle:

1. Ideation → Generative Exploration

LLMs and copilots are helping teams explore what to build faster than ever. Prompt a model with “How might we reduce checkout friction for Gen Z users?” and it can instantly generate hypotheses, user flows, and even A/B test copy variants.

2. Design → Data-Infused Creativity

Design tools powered by AI (like Figma’s AI assistant or Uizard) now simulate user reactions, predict engagement heatmaps, and propose design alternatives based on prior experiment data.

3. Build → Experiment-Ready Engineering

Modern engineering frameworks integrate feature flags, metric tracking, and causal validation directly into the codebase. This allows for safe experimentation at scale, every rollout is testable by design.

4. Launch → Causal Attribution

Instead of asking “Did this feature correlate with higher conversion?” teams now ask “Did it cause it?” Causal inference frameworks (Propensity Matching, Difference-in-Differences, Meta-learners) help isolate true impact from noise.

5. Learn → Automated Knowledge Loops

Agentic AI systems summarize experiment learnings, identify patterns across experiments, and suggest next actions, forming a self-improving experimentation ecosystem.

The New Product DNA: Core Components

Modern product organizations are evolving from static roadmaps to adaptive, learning systems. At the heart of this transformation lies a new architecture (highlighted below) - a Product DNA where AI, data, and experimentation form the building blocks of continuous innovation.

Real-World Example: The AI Experimentation Loop in a Digital Health Platform

Consider a digital health platform for the management of chronic conditions such as diabetes and hypertension. The goal of this product would be to improve daily engagement through encouraging users to record their vitals, adhere to care plans, and take medication as directed. Rather than setting fixed reminders, the product team will create an AI-powered experimentation loop that continuously learns the user's behavior and gradually fine-tunes its interventions in real time.

It begins with a Data Backbone: every interaction is captured—glucose logs, step counts, coach messages, and reminders-unified into one secure telemetry system. This causal-ready foundation connects wearable sensor data, app behavior, and contextual signals like time of day and mood, allowing for cause and effect to be measured with precision.

Through the identification of such patterns, the AI Engine applies LLMs and predictive models to formulate hypotheses. It may identify that users logging meals less than 10 minutes after eating are showing a much higher degree of adherence, triggering new tests of customized notifications or even empathetic message tones for users who display fatigue.

Each of these ideas moves into the Experimentation Layer, where different behavioral nudges are compared in controlled A/B or adaptive tests. For instance, one group receives fixed daily reminders, while another group gets adaptive prompts triggered from sensor data. Effectiveness is determined by metrics such as adherence, the number of app openings, and glucose stability, each automatically favoring better variants through bandit algorithms.

The Decision Orchestrator then summarizes results-for example, "adaptive reminders improved adherence by 8% among evening users" - and schedules the next round of tests. Finally, insights feed into Feedback Memory, a long-term intelligence system that stores metadata on what worked, for whom, and why.

Over time, the platform becomes a self-learning health ecosystem where every interaction reinforces its knowledge of user behavior. The result is more than greater engagement; it's a living, data-driven product that continuously tailors care and fuels innovation.

Next in the Series

Banani’s Next Article: “Rewiring Product Management with Generative AI: From Roadmaps to Deployment.” She’ll explore how generative AI is reshaping product strategy from idea generation to roadmap alignment and real-time user feedback loops.

Nebius at SC25: Building the Neocloud for Enterprise AI

From SC25 in St. Louis, Nebius shares how its neocloud, Token Factory PaaS, and supercomputer-class infrastructure are reshaping AI workloads, enterprise adoption, and efficiency at hyperscale.

Inside a data center with endless amounts of server racks

Inside SC25: AI Factories From Racks to Qubits

AI didn’t just show up at Supercomputing 2025 (SC’25) in St. Louis—it took over the agenda. From exabyte-scale storage and 800 Gbps fabrics to liquid-cooled racks and emerging quantum accelerators, SC25 made it clear that the next era of HPC is really about building AI factories end to end.

Below is a structured look at the announcements the TechArena team is tracking, organized around the major layers of the stack.

1. Data and Memory Platforms for Agentic AI

The most urgent theme on the show floor: getting more useful work out of every GPU. That starts with memory and data.

WEKA: Breaking the GPU Memory Wall and Storage Economics

WEKA formally took its Augmented Memory Grid from concept to commercial availability on NeuralMesh, validated on Oracle Cloud Infrastructure (OCI) and other AI clouds. The goal is to extend GPU key-value cache capacity from gigabytes into the petabyte range by streaming KV cache between HBM and flash over RDMA using NVIDIA Magnum IO GPUDirect Storage.

The reported gains are significant: 1000x more KV cache capacity, up to 20x faster time-to-first-token at 128k tokens versus recomputing prefill, and multi-million IOPS performance at cluster scale. For long-context LLMs and agentic AI workflows, that means fewer evictions, less recompute, and better tenant density per GPU — directly attacking inference cost structures on OCI and other platforms.

On the hardware side, WEKA’s next-gen WEKApod appliances push the economics further. WEKApod Prime uses “AlloyFlash” mixed-flash configurations to deliver 65% better price performance while preserving full-speed writes, and WEKApod Nitro focuses on performance density with 800 Gb/s networking via NVIDIA ConnectX-8 SuperNICs. Together, they target AI factories that need high GPU utilization, high density, and lower power per terabyte.

VAST Data + Microsoft Azure: AI OS Meets Cloud Scale

VAST Data is extending its AI Operating System into Microsoft Azure. VAST AI OS will run on Azure’s Laos VM Series with Azure Boost, giving customers a unified “DataSpace” global namespace so they can move between on-prem and Azure without refactoring data pipelines.

InsightEngine and AgentEngine let customers run vector search, RAG pipelines, and agent workflows directly where the data lives, and the underlying disaggregated, shared-everything (DASE) design allows independent scaling of compute and storage. The combined effect is a cloud-native AI operating system tuned for agentic AI pipelines, built to keep Azure’s GPU and CPU fleets saturated.

MinIO ExaPOD: Exabyte as a Design Point, Not an Edge Case

MinIO’s ExaPOD reference architecture plants a big flag for exascale AI data. It’s a 1 EiB usable building block (about 36 PiB usable per rack) that scales linearly in performance and capacity. In the reference design, ExaPOD delivers on the order of 19.2 TB/s aggregate throughput at 1 EiB with 122.88 TB drives, around 900 W of power per PiB including cooling, and modeled all-in economics in the $4.55–$4.60/TiB-month range at exabyte scale.

Built on Supermicro servers, Intel Xeon 6781P, and Solidigm D5-P5336 NVMe, ExaPOD is clearly aimed at hyperscalers, neoclouds, and large enterprises that see exabytes as the new baseline for LLMops, simulations, and observability.

2. Power, Cooling, and the Introduction of PCE

As AI deployments creep toward gigawatt footprints, power and cooling have shifted from “facility detail” to board-level design constraint.

Airsys PowerOne and Aegis: Cooling as a Compute Multiplier

Airsys introduced PowerOne, a modular, multi-medium cooling architecture that scales from 1 MW edge sites to 100+ MW hyperscale data centers. It’s tailored for AI and HPC density with a standard cooling stack (CritiCool-X chiller, FluidCool-X CDU, MaxAir fan wall, Optima2C CRAH) and a LiquidRack spray-cooling architecture that can operate in compressor-less modes with dry coolers where climate allows.

Beyond traditional PUE, Airsys is pushing Power Compute Effectiveness (PCE)—a metric that measures how much provisioned power turns into usable compute. The message is that cooling should unlock stranded power and convert it into AI capacity, not just shave a few basis points off energy overhead.

In parallel, Aegis, an affiliated liquid-cooling arm, is being positioned as an agile R&D hub building two-phase CDUs, cold plates, and control systems using rapid 3D manufacturing to keep pace with AI thermal demands.

Schneider Electric and Motivair: Integrated Power + Liquid Cooling

Schneider Electric is leaning into its acquisition of Motivair, blending global power and infrastructure capabilities with more than 15 years of exascale and accelerated-computing cooling experience. The combined portfolio spans chip-level cold plates, rear-door heat exchangers, CDUs, and facility-level power and control systems.

The through-line is that liquid cooling is now being evaluated as part of a full-stack design conversation with power and infrastructure, especially for hyperscale, co-locators, and high-density AI factories where 100 kW-plus racks are quickly becoming normal.

Iceotope KUL BOX: Liquid-Cooled AI at the Noisy, Messy Edge

Iceotope’s KUL BOX brings the AI factory cooling story out of the core data center and into edge environments that were never designed for dense clusters. It’s a compact, liquid-cooled AI inferencing cluster built as a turn-key system: a 24U rack with six Iceotope KUL AI chassis, up to 24 NVIDIA GPUs, top-of-rack switching, and a fully integrated liquid-cooling loop.

The key twist is deployment model. KUL BOX captures almost all of the system’s heat using Iceotope’s precision immersion cooling and rejects it through a separate liquid-to-air outdoor cooler—meaning it can be installed in locations without existing facility water, dry chillers, or traditional white-space infrastructure.

Iceotope highlights several benefits for edge AI and HPC workloads: consistent GPU throughput and reliability from stable thermals, lower energy and cooling overheads, quiet, fanless operation, and a single-vendor solution that bundles rack assembly, fluids, pipework, logistics, on-site installation, and a three-year service plan. Target use cases include telcos and colocation providers, labs running sensitive compute-heavy tasks, and industrial edge deployments with unusual constraints or sustainability requirements.

3. Server and Computing Hardware

On the compute side, vendors largely converged on the same message: more FLOPS per rack, more memory per GPU, and more network bandwidth behind every accelerator.

Dell Technologies: AI Factory Building Blocks

Dell made its AMD Instinct-powered PowerEdge XE9785 and XE9785L servers generally available and introduced the new Intel-powered PowerEdge R770AP. All three are tuned for demanding AI and HPC workloads as part of the Dell AI Factory with NVIDIA.

On the network side, Dell’s new PowerSwitch Z9964F-ON and Z9964FL-ON switches deliver 102.4 Tb/s of switching capacity, targeting dense AI fabrics. Dell also announced integration of ObjectScale and PowerScale storage systems with NVIDIA’s NIXL library, tightening the connection between storage services and GPU-centered inference stacks.

Supermicro, ASUS, Compal, EnGenius: Dense GPU Nodes and Liquid Cooling

Several OEMs showcased how fast they can pack accelerators into standard racks:

Supermicro highlighted Data Center Building Block Solutions featuring NVIDIA GB300 NVL72 systems with 72 Blackwell Ultra GPUs and liquid cooling up to 200 kW per rack. It also launched a 10U air-cooled AMD Instinct MI355X server that claims up to 4x compute and 35x inference performance versus its predecessor.

ASUS unveiled its XA AM3A-E13 server with eight AMD Instinct MI355X GPUs and dual AMD EPYC 9005 CPUs, offering 288 GB of HBM and up to 8 TB/s of memory bandwidth in a modular 10U chassis. The platform complements ASUS’ broader AI infrastructure portfolio, including NVIDIA GB300-based systems.

Compal brought high-density, liquid-cooled SG720-2A/OG720-2A servers supporting up to eight AMD Instinct MI325X GPUs with forward compatibility for MI355X, plus the SG223-2A-I immersion-cooled system that supports up to eight PCIe GPUs in a 2U chassis.

EnGenius, better known in networking, jumped into the server market with modular Intel Xeon 6-based systems. The flagship 4U EAS5210 can be configured with up to eight Intel Arc Pro B60 accelerators for LLMs and AI training workloads, built on an OCP DC-MHS architecture.

Intel Xeon 6: Keeping CPUs Relevant in AI HPC

Intel used SC25 to emphasize that CPUs still matter in HPC and AI workflows, particularly for simulation, pre/post-processing, and orchestration. The Xeon 6 line targets up to 2.1x faster performance on key HPC workloads like LAMMPS, OpenFOAM, and Ansys Fluent, riding on higher memory bandwidth and built-in AI acceleration.

4. Networking for Gigascale AI: Cornelis and Friends

If storage and cooling are about feeding GPUs and keeping them alive, networking is about making the entire AI factory behave like a single coherent system

Cornelis CN6000 SuperNIC: 800 Gbps, Multi-Protocol, AI-first

Cornelis rolled out its CN6000 SuperNIC, an 800 Gbps adapter that brings its Omni-Path architecture into Ethernet for the first time. CN6000 combines ultra-low latency, up to 1.6 billion messages per second, and full 800 Gbps throughput in a single device.

A key design point is “limitless” RoCEv2 scalability. Traditional RoCEv2 fabrics struggle at scale because managing queue pairs becomes memory-heavy and brittle. Cornelis tackles that with lightweight QPs and a hardware-accelerated RoCEv2 In-Flight table that can track millions of concurrent operations while maintaining predictable latency. The CN6000 is fully compliant with Ultra Ethernet and RoCEv2, positioning it as a standards-based path to 800 Gbps Ethernet fabrics that behave more like purpose-built HPC interconnects.

Cornelis is aligning the CN6000 with next-gen Intel Xeon platforms and working with partners like Intel, AMD, Lenovo, Synopsys, Altair, Atipa, Nor-Tech, Microway, PSSC Labs, and SourceCode to build end-to-end 800G solutions and Omni-Path-based switches and directors.

NVIDIA BlueField-4 and Quantum-X Photonics

On the NVIDIA side, BlueField-4 DPUs continued to show up as a central control plane and offload engine for AI factories. NVIDIA highlighted how storage vendors like DDN, VAST Data, and WEKA are adopting BlueField-4 to push storage services closer to GPUs and eliminate bottlenecks.

NVIDIA also spotlighted Quantum-X Photonics co-packaged optics InfiniBand switches, offering 800 Gb/s per port with significantly better power efficiency than traditional pluggable optics. TACC, Lambda, and CoreWeave are among the operators planning to integrate Quantum-X Photonics into their next-generation systems.

5. HPE, National labs, and Exascale Blueprints

SC25 also reinforced how national labs are shaping the AI/HPC roadmap.

At Oak Ridge National Laboratory, HPE and AMD are partnering on Discovery and Lux—two new systems that blend large-scale simulation with AI training and inference. Lux is positioned as a dedicated AI factory for science and energy, while Discovery focuses on high-bandwidth exascale computing.

At Los Alamos National Laboratory, HPE and NVIDIA are collaborating on Mission and Vision, based on the new HPE Cray GX5000 platform and NVIDIA’s latest CPU/GPU and Quantum-X800 InfiniBand technologies. Mission targets national security workloads; Vision will serve as an unclassified AI and science system and successor to Venado.

For the broader ecosystem, these systems serve as reference architectures for how to co-design CPUs, GPUs, networks, and cooling for converged AI plus simulation workloads.

6. Quantum Computing Edges Closer to Production

Quantum wasn’t the main act at SC25, but it was no longer relegated to the demo corner.

QuEra + Dell: Quantum as Another Accelerator Class

QuEra and Dell demonstrated hybrid quantum-classical workflows where neutral-atom quantum processing units integrate into standard Dell HPC infrastructure via the Dell Quantum Intelligent Orchestrator. The point of the demo: treat quantum as a first-class accelerator alongside CPUs and GPUs instead of a separate science experiment.

Quantum Computing Inc. Neurawave: Photonics for Edge AI

Quantum Computing Inc. (QCi) announced Neurawave, a compact, photonics-based reservoir computing system in a standard PCIe form factor. Operating at room temperature, Neurawave targets edge-AI workloads such as signal processing, time-series forecasting, and pattern recognition, offering fast, energy-efficient processing that complements QCi’s quantum systems.

D-Wave: Annealing as an Energy-Efficient Accelerator

D-Wave highlighted how its Advantage2 annealing quantum computer can tackle combinatorial optimization problems with lower energy use than classical approaches — an angle that resonates as AI and HPC operators watch their power budgets tighten.

7. Other Notable Infrastructure Moves

A few additional announcements round out the “plumbing” for AI factories.

Phison introduced new PCIe Gen5 Pascari X201 and D201 enterprise SSDs, tuned for AI training, hyperscale analytics, and mixed read/write inference workloads. They push Gen5 performance to the edge with high throughput and low latency for data-hungry environments.

Hammerspace showcased its AI solution aligned with the NVIDIA AI Data Platform reference design, providing a unified data foundation for RAG workloads, agentic AI pipelines, and hybrid environments. The goal is to give AI workloads instant access to the right data without re-architecting storage.

TechArena Take: The AI Factory Stack Advances

Stepping back from the logos and part numbers, it’s clear that AI is dominating HPC, driven by policy priorities. SC25 felt like the moment AI factories stepped out of the box and began to rapidly progress.

A few patterns stand out.

First, the bottlenecks have officially moved away from raw FLOPS. The interesting innovation is happening around memory hierarchies, storage fabrics, and KV cache management—exactly the spaces WEKA, VAST, MinIO, and Hammerspace are targeting. The vendors that can prove “more useful tokens per GPU per kilowatt” are going to win the next buying cycle.

Second, power and cooling have been dragged into the AI design conversation whether facilities teams are ready or not. PCE, liquid spray cooling, direct-to-chip loops, 200 kW racks, and now sealed, liquid-cooled edge clusters like Iceotope’s KUL BOX are no longer exotic; they’re becoming prerequisites for deploying Blackwell-scale and inference-heavy clusters wherever the data lives. Cooling is quietly turning into a business lever: whoever can convert the most stranded power into usable compute wins.

Third, the network is being rebuilt around AI assumptions. 800 Gbps Ethernet with Ultra Ethernet, RoCEv2 at scale, CN6000-class SuperNICs, BlueField-4 DPUs, and co-packaged optics all point to the same conclusion: traditional data center Ethernet and “good enough” InfiniBand islands won’t cut it at multi-thousand GPU scale. Deterministic, congestion-free fabrics are table stakes if you want agentic AI to actually run reliably.

Finally, quantum and photonics are edging toward “adjacent accelerators” rather than lab toys. They’re not replacing GPUs any time soon, but they’re already being wired into the same orchestration planes and data fabrics as everything else.

Supercomputing used to be the place to talk about peak FLOPS. In 2025, it quietly turned into the place to advance the entire AI factory—from chip to coolant loop to the edge box bolted to a wall in the field.

‍

Digital shield representing cybersecurity protection

15 Questions Every Cybersecurity Leader Should Ask on Day One

Stepping into a new cybersecurity leadership role can feel like walking into the middle of a story already in progress. The dashboards are glowing, the acronyms are flying, and everyone seems to have a version of what “secure” really means. Before you start making changes or setting new goals, take a breath. The most effective leaders begin not by talking, but by asking.

The right questions will help you uncover what is really happening beneath the surface: how decisions are made, where risks hide, and how people truly feel about security. Here are 15 questions that can guide you through your first few weeks and help you see the whole picture before you start rebuilding.

1. What does “secure” mean here?

Every organization defines security differently. For some, it is about compliance. For others, it is about resilience or customer trust. Start by understanding what your leadership values when they say something is “secure.”

2. What are we protecting, and why does it matter?

You cannot protect everything equally. Get clarity on the company’s crown jewels, what is truly critical to the business and what is not. Once you know what matters most, your priorities will fall into place.

3. Who really makes the security decisions?

Titles can be misleading. Learn who drives decisions day to day, whether it is a program manager, an architect, or a trusted advisor. Understanding influence is often more useful than understanding the org chart.

4. What keeps our executives up at night?

Security programs thrive when leadership feels confident in them. Ask what worries your executives most and then connect your strategy directly to easing those fears.

5. Where does our data live, and who has access to it?

If you cannot answer this, you cannot secure it. Take time to understand where your data resides, how it moves, and who touches it along the way.

6. What incidents shaped this program’s history?

Every program carries the lessons of its past. Find out what went wrong before, what was learned, and what still lingers as an unspoken worry. These stories reveal more than any dashboard can.

7. How long does it take us to detect and respond to an incident?

Metrics like mean time to detect and mean time to respond are only part of the picture. Ask how those numbers are measured and what slows the process when real incidents happen.

8. How do we decide which risks to live with?

Cybersecurity is a series of trade-offs. Learn how risk is evaluated, who approves exceptions, and how those decisions are documented. You will quickly see whether your organization is proactive or reactive.

9. How do people really feel about security here?

Culture drives outcomes. Talk to engineers, analysts, and business partners. Do they see security as a helpful partner or an obstacle? Their answers will tell you where trust needs to be built.

10. Which tools do people trust, and which do they avoid?

Inherited tools often create as many problems as they solve. Ask your team which systems they rely on and which ones they quietly ignore. Their insights will guide where to invest and where to simplify.

11. Where are we flying blind?

Every program has blind spots. It might be unmanaged assets, unmonitored environments, or third parties no one tracks closely enough. Find those dark corners early and bring them into the light.

12. If the team could fix one thing, what would it be?

This simple question opens doors. It shows respect for your team’s experience and often surfaces the problems that leadership never sees. Listen carefully—this one question can change your roadmap.

13. How do we measure success beyond compliance?

Passing audits is good, but it is different from being secure. Ask what metrics truly reflect resilience, preparedness, and continuous improvement. Those are the ones that matter.

14. What happens when an incident hits?

A plan on paper is not enough. Ask how communication works during real incidents. Who gets the first call? How quickly are decisions made? The answers will show how well your plan translates to practice.

15. Who has the keys to the kingdom?

Privileged accounts, admin credentials, and token signing keys define the boundaries of trust. Know who controls them, how they are protected, and what checks exist. If no one can answer confidently, that is your first red flag.

Final Thoughts

Taking over a security program is not about proving how much you know. It is about understanding the ecosystem you are stepping into the people, the risks, the culture, and history. The best leaders do not rush to fix things. They listen first, connect dots others overlook, and build trust before acting.

Your first few weeks set the tone for everything that follows. Start with curiosity. Ask the questions no one else is asking. When you do, you will not only understand the program you have inherited, but you will also earn the confidence to lead it forward.

Inside Runpod’s GPU Cloud for Long-Running AI Agents

Runpod head of engineering Brennen Smith joins a Data Insights episode to unpack GPU-dense clouds, hidden storage bottlenecks, and a “universal orchestrator” for long-running AI agents at scale.

Glowing processor with light rays shooting out of it

Achieving HPC Performance with Xeon 6 Processors

In HPC circles, the discussion often revolves around achieving “peak FLOPS.” And while compute is unquestionably important to high performance computing, many HPC applications including scientific simulation, graph analytics, finite element modeling are actually memory-bound. Compute waits for data. That’s where Xeon 6 processor architecture, and it’s support for scalable memory, shines.

I recently met with a national lab team whose spectral simulation scaled beautifully to 128 nodes, but performance flattened beyond 128 because memory bandwidth per core collapsed. Their compute was starving, waiting for data that was slow to arrive. While it would be tempting to throw more flops at the problem, the more elegant solution to their compute constraint was delivery of a balance of compute, interconnect, and memory performance.

Luckily, we’ve designed today’s Xeon CPUs for this balanced performance. In the Xeon 6 processor family, the architecture supports both P-cores (for compute) and E-cores (for throughput) on a unified I/O and memory interface, providing computing environments found in the HPC arena the flexibility they crave. That shared fabric ensures that memory bottlenecks don’t isolate execution to narrow lanes, delivering data effectively to meet compute requirements.

How do we accomplish it? It starts with high throughput memory channels and large cache hierarchies, essentially reducing memory contention within the system. We extend this with a NUMA design, carefully tuned, ensuring that parallel tasks see minimal cross-node memory latency. We layer on low-latency coherence paths, essential to multi-socket configurations found in HPC platforms, and multi-workload support mixing compute, data staging, and I/O including checkpointing and data orchestration. This removes the opportunity for non-compute tasks to gate total workload performance.

The Path Ahead

Do Xeon 6 processors make sense for your HPC configuration? Evaluate your workload requirements, and assess if you’re gated by memory-bound application constraints. I think you’ll discover that the balanced system performance delivered by our latest generation of processors can change the game for compute delivery, making them a solid foundation for HPC deployments.

‍

Inside KubeCon 2025: How AI Is Rewriting Kubernetes Operations

Self-healing has long been Kubernetes’ north star: restart failed pods, reschedule workloads, reconcile desired state, and keep applications running through everyday failures. But AI is piling on new pressure as teams run GPU-hungry models, mix batch and real-time inference, and stretch Kubernetes across fleets of clusters and clouds. At enterprise AI scale, that pressure lands on site reliability engineering (SRE) and platform teams, who have to reason about GPU scarcity, token volume, spiky inference, and large Kubernetes fleets all at once.

Hundreds of clusters, thousands of services, and a flood of change create too many variables for humans to chase in real time. The question is no longer whether Kubernetes can restart a pod; it’s whether platforms encode SRE judgment into systems that act quickly, safely, and with an audit trail on their behalf.

At KubeCon + CloudNativeCon North America in Atlanta this week, chatter about “agentic SRE” could be heard up and down the massive showroom floor of the Georgia World Congress Center; meanwhile, the Cloud Native Computing Foundation (CNCF) unveiled its new Kubernetes “AI Conformance” push during the opening keynotes. Jonathan Bryce, executive director of cloud and infrastructure at CNCF and Chris Aniszczyk, CNCF chief technology officer, opened the sessions with a call to action to make AI workloads portable and interoperable across platforms, just as conformance once standardized Kubernetes itself for every major cloud service or private cloud option.

Against this backdrop, I spent a few days talking with exhibitors about agentic SRE and how AI is fundamentally changing Kubernetes operations.

Devtron, Komodor, and Dynatrace are each coming at that problem from a different angle. Devtron is collapsing application, infrastructure, and cost into a single view and layering in an agentic SRE interface. Komodor is turning static runbooks into a policy-scoped, multi-agent SRE that can self-heal fleets and even live-migrate pods off spot instances. Dynatrace is pushing observability from dashboards into decisions while asking whether AI is actually earning its keep.

Taken together, they sketch an ops layer that looks a lot like what AI Conformance is aiming for at the platform layer: standard patterns for how AI runs, heals, optimizes, and proves value on Kubernetes—without treating AI infrastructure stacks as fragile, one-off, bespoke environments that each need custom care and feeding.

Devtron: Merge Applications and Infrastructure, Hide Tool Chaos, Add an SRE Calculator

Devtron, an enterprise open-source Kubernetes management platform with more than 21,000 installations powering over 9 million deployments, launched Devtron 2.0 during Day 1 of the convention. The release adds an ‘Agentic SRE’ layer on top of its existing footprint to bring AI-powered autonomous operations to production Kubernetes that has to withstand catastrophic failures, ransomware, and high-availability demands at scale.

Devtron 2.0 starts with a very human problem: operators drowning in tools and organizational lines between applications and infrastructure. I chatted with CEO Ranjan Parthasarathy, who said the company wants to “simplify the lives of operators who are running Kubernetes in production.”

“Managing Kubernetes in production is challenging because, first of all, there are too many tools,” Parthasarathy said. “Second of all, there is a very clear line that separates applications and infrastructure management.”

Devtron 2.0 explicitly mimics Kubernetes’ own design.

“We have taken the approach Kubernetes took from day one, which is, they blurred the lines between app and infra in how Kubernetes is architected,” he said. “The APIs for app and infra are all the same. The way you capture app and infra in the form of manifests is all the same. So, why should manageability create an artificial separation?”

Devtron’s answer is a single environment where you can follow a problem from logs to infrastructure to cost without hopping through a half-dozen consoles, with integrated FinOps and GPU visibility so AI workloads are first-class citizens in that view.

According to Devtron, customers like BharatPe and 73 Strings are already using the platform to shrink release cycles from months to weeks and cut mean time to recovery from days to under an hour, which is the backdrop for everything Devtron is now doing with agentic SRE. Their agentic SRE layer walks the classic maturity curve: start with safe reads, then layer in human-approved changes.

“Explain is a feature that we have in our UI at select strategic places where, the minute an error happens, the user can say, ‘Explain.’ And it explains in human readable form what really happened,” Parthasarathy said.

The system also drafts remediation actions that humans review and, once tested in the wild, can bless as auto-apply for recurring conditions. The agent is more like a calculator than a replacement, Parthasarathy said.

Komodor: Autonomous AI SRE with Guardrails and Live Migration

Komodor is the autonomous AI SRE company for cloud-native infrastructure and operations. At KubeCon, the team highlighted new autonomous self-healing and cost optimization capabilities powered by Klaudia, a purpose-built agentic AI system that sits on top of Komodor’s existing Kubernetes troubleshooting platform.

Klaudia—a multi-agent system—sits closer to the hands-on-the-keyboard side of SRE. The company has run a Kubernetes troubleshooting platform for years; now they’re releasing an additional agentic AI layer on top of it, said Udi Hofesh, who works in product marketing and developer relations for the company.

“This enables the same great value autonomously, basically saving more time and providing more accurate, more expansive insights and recommendations,” he said.

The core idea is to turn Kubernetes’ reconciliation model into practical self-healing at fleet scale. Komodor’s media release about Klaudia leans hard into the scale of that problem. They cite industry data showing that 88% of technology leaders report rising stack complexity and cloud waste often exceeds 30% of total spend when misconfigurations and idle capacity linger. In one Cisco environment, the company says Klaudia helped cut ticket volume by roughly 40% and accelerated mean time to recovery by more than 80%.

“Kubernetes works and is built around reconciliation,” said Mickael Alliel, backend tech lead at Komodor.

Klaudia’s policies are designed to reconcile the workloads and applications of Komodor’s customers to always be healthy and in a working state, not just fire off static runbooks. That dynamic behavior is the key difference from traditional automation.

“The automatic runbooks or playbooks are, let’s say, something that doesn’t change,” Alliel said. “Klaudia, as the autonomous AI SRE, is able to do it a lot more dynamically… it acts as a real site reliability engineer as opposed to just a series of steps.”

With graph-wide context, Klaudia can pull telemetry from multiple namespaces and components and “get a root cause analysis up and running in as little as 15 or 30 seconds,” he said, which matters a lot when you’ve got one SRE for dozens of teams.

Guardrails are a big part of the story, especially for teams burned by LLM hallucinations.

“We actually try to enforce on Klaudia and the AI SRE as many safeguards as possible to ensure that the AI doesn’t hallucinate,” Hofesh said. “We try to ensure that, if it doesn’t know something, it will say. ‘I don’t know and I need more information’… instead of just spitting out something that is not true.”

Every action is logged, and an SRE can see both a full summary of all the actions that Klaudia has taken and the reasoning behind them.

“We gave (Klaudia) a name and a face, but it’s actually hundreds of agents that are interacting with each other,” Hofesh said. For each component in the cloud-native stack, there’s a domain expert agent, orchestrated by workflow agents that mimic SRE motions like detect, investigate, optimize.

Beyond incident response, Klaudia also pushes into cost optimization. Komodor is using it to dynamically right-size workloads, schedule pods to avoid idle resources and bin-packing dead-ends, and use their PodMotion capability to move pods and state across nodes with zero downtime so teams can chase cheaper capacity or handle infrastructure events without disrupting applications.

Dynatrace: AI Observability, ROI Pressure, and Active Decisioning

Dynatrace is an AI-powered observability and security platform that unifies application, infrastructure, log, and business data in a single data lakehouse and uses its Davis AI engine to turn that telemetry into real-time insights and automated remediation. It has had AI in its stack for more than a decade.

“We have been working in the AI space for over 12 years,” said Chief Technology Strategist Alois Reitbauer. “We were always the odd people out—the people doing stuff with AI for a very long time. Not generative AI, but AI in general, and machine learning. We use predictive AI to predict behavior and detect anomalies, then use causal AI to understand the root cause of a problem, to understand cause and effect.”

What’s shifted recently is the focus of AI observability. Early on, he said, it was about tokens and performance. Now that more systems are in production, the question has become whether it provides value.

“It’s not just, ‘How much money are we spending,’ but, ‘Do people actually get something out of it? Should we keep investing into it?’” he said.

Reitbauer pointed out that AI budgets aren’t created from thin air. He explained that, as companies move investment to AI from other areas, they have an expectation of ROI that is at least as high, if not higher, than before. He gave an example of a website that offers a product for $3, and pays $5 to generate the recommendation; not exactly a model of ROI.

On the plumbing side, he described observability’s progression from collecting data, to anomaly detection, to root-cause analysis and now to action.

“We’re moving into the next generation where tools are actually able to take action,” he said. Instead of just saying “your system is down, your servers are overloaded,” a next-gen system might say: “Your system is down, your servers are overloaded. I propose an immediate mitigation action to scale up from three to five servers… and I already created the PR, just click approve here.”

Long term, it can also surface proposals for the developer on how that code could potentially be rewritten to be more efficient.

Dynatrace’s internal agentic platform is wiring those pieces into workflows, Reitbauer said.

“Think of it as a low-code way of building an agent, almost,” he said.

The use cases line up with the themes from KubeCon: remediation workflows based on observability data, preventive workflows that reconfigure environments before trouble hits, and continuous optimization tuned to cloud environments.

TechArena Take

Kubernetes AI Conformance is about making AI workloads on Kubernetes interoperable and portable across a messy mix of models, frameworks, and hardware.

The companies I talked with are doing the same thing for operations: turning AI-heavy Kubernetes environments from bespoke into systems that can be monitored, healed, optimized, and justified at scale.

The interesting advances aren’t the fully autonomous slogans; they’re the boring-but-essential scaffolding behind them. These platforms are also shipping against real-world pain. Devtron points to customers like BharatPe and 73 Strings using its unified control plane to shrink release cycles, improve stability, and drive MTTR down from days to under an hour. Komodor cites Cisco’s platform engineering team cutting ticket loads by around 40 percent and improving MTTR by more than 80 percent as Klaudia moves from reactive triage to proactive self-healing and optimization.

Devtron’s merged application/infrastructure/cost view and human-in-the-loop agent treat autonomy like a calculator, not a replacement. Komodor’s domain-specific multi-agent approach attacks both the incident math and the spot-capacity economics. Dynatrace is pushing observability from a passive system of record toward an active participant that can propose or trigger changes—and then tie those moves back to business outcomes.

If Day 1 in Atlanta was about putting a floor under AI workloads with Kubernetes AI Conformance, these conversations were about building the mezzanine above it: how AI actually runs and proves itself in production. The self-healing promise of Kubernetes isn’t going away; it’s being reimplemented at the organizational layer—across clusters, costs, and teams—so platform and SRE leaders can keep up with AI-era workload diversity and autonomy without scaling humans linearly with every new deployment.

Abstract highway and city of a computer circuit board

MLPerf Training v5.1: GenAI Drives, Scale Ramps, Field Widens

MLCommons’ MLPerf Training v5.1 lands with three clear signals: generative AI continues to shape the benchmark mix, scaling discipline is where many teams are winning time, and the roster of credible submitters is getting broader. This round includes 65 unique systems from 20 organizations spanning silicon, systems, clouds—and, notably, an academic HPC center.

“The view I see in MLPerf is like a Formula One race—same track and rules, room for tuning, and you see who can finish fastest,” said Chuan Li, Chief Scientific Officer at Lambda, who led the company's MLPerf v5.1 efforts. It’s a clever encapsulation of MLPerf’s value proposition: standardized tasks and target quality keep the contest honest while leaving space for technique.

What’s new in v5.1 is squarely aimed at today’s workloads. Llama 3.1 8B replaces BERT for LLM pretraining—a modern, decoder-only architecture that fits on a single node (≤8 accelerators) yet mirrors software patterns used at larger scales. On the image side, Flux.1 replaces Stable Diffusion v2, reflecting the shift to transformer-based diffusion models with cleaner validation via loss. Together, these swaps align the suite with the stacks enterprises are actually deploying.

Momentum and scale show up in the submission patterns. Multi-node entries climbed sharply versus a year ago, and genAI tests drew heavy participation: Llama 3.1 8B debuted with strong interest, while Llama 2-70B LoRA continued to be a favorite fine-tuning proxy. Performance trends outpaced a simple Moore’s Law line again; gains came not just from fresh silicon but from numerics, software, and fabrics—exactly the kind of full-system work buyers need to see.

NVIDIA posted most of the fastest times and largest-scale runs this round—especially on GB200/GB300 NVL72 configurations—reflecting stack maturity and intra-rack NVLink scale. Still, the broader story is ecosystem momentum: new entrants, academic participation, and software/networking gains that turned more multi-node runs into reproducible, closed-division results.

Standout Highlights: New Entrants

• University of Florida (academic first-timer): Ran across seven benchmarks on HiPerGator DGX B200, including multi-node scaling to 448 GPUs, demonstrating closed-division reproducibility on a shared HPC environment.

• Wiwynn (platform new entrant): Kinabalu posted Llama 2-70B LoRA results at 72 and 576 GPUs on GB200 NVL72, signaling readiness of its NVLink-centric design for fine-tuning workloads.

• Datacrunch (cloud first-timer): Brought up an 8× B200 Llama 3.1-8B run via Slurm/Pyxis “Instant Clusters,” positioning for fast, reproducible re-runs rather than one-off hero numbers.

Precision and practicality deserve a note. Several submitters leaned into lower-precision training (FP8 → FP4 variants) where numerically stable, but MLPerf’s rubric keeps that grounded: time-to-target-quality forces any optimization to actually converge. The other big lever is networking and topology—RDMA over InfiniBand or tuned Ethernet, clean hierarchies, and reliability at scale—because eight nodes only help if they act like eight, not three.

Lambda: a Rack That Behaves Like One Giant GPU

Lambda was one of a short list to post on GB300 NVL72 (72 Blackwell Ultra GPUs in a single NVLink domain). Two takeaways surfaced in my side interview with Chuan Li. First, the speed-ups split roughly half-and-half between hardware (more memory, higher inter-GPU bandwidth) and software (driver/library/framework maturation).

Second, numerics helped at the margin: moving from FP8 to an FP4 variant delivered an additional double-digit percentage improvement while still meeting the accuracy target. There’s also a practical lesson here: clean, converged runs at the edge of scale require weeks of lined-up capacity and tight coordination across DC ops, fabric, and software. Useful proof point—one of 20, not the whole story.

“We saw a 1.66x speedup in our Llama 2-70B run compared to previous submissions,” Li said. “This performance improvement showcases the power of the latest NVIDIA hardware, combined with Lambda’s cloud orchestration capabilities.”

How to Read the Tables Without Getting Lost

If you’re using v5.1 to guide purchasing or platform bets, a few simple rules help:

• Compare like for like. Start within the same benchmark and similar accelerator counts. A 32-GPU result and a 512-GPU result are not interchangeable.

• Look for scale curves, not just a single number. Do you see near-linear improvements from 8 → 16 → 32 → 64 GPUs? That often tells you more than a hero time.

• Check the software notes. Frameworks, kernels, parallelism strategy, IO/storage, and data pipelines are where much of the delta lives—and MLPerf links to them.

• Use Llama 3.1 8B as a quick stack sanity test. It’s modern, single-node accessible, and a good proxy before you commit to larger spend.

• If you care about image generation, Flux.1 is the new reality. Expect different stress points than SD v2 (attention/memory/diffusion schedule) and plan tuning accordingly.

• Treat FP4 wins as conditional on convergence. The fastest path that misses target quality doesn’t count in MLPerf—and it shouldn’t in production, either.

TechArena Take

This round isn’t just about newer GPUs; it’s about maturing engineering. The two new tests (Llama 3.1 8B and Flux.1) meet the moment for enterprise Gen AI, and the influx of first-time submitters expands the set of credible places to run—from platform OEMs to an academic HPC center to nimble clouds.

As organizations continue to push the boundaries of AI infrastructure, the industry is seeing an acceleration of hardware, software, and networking innovations that are making frontier AI models more accessible and deployable at scale.

As AI infrastructure evolves, MLPerf Training provides a vital benchmark for the industry, ensuring that progress in AI development is transparent, reproducible, and measurable.

Cloud-Native 2025 by the Numbers: The Developer Tent Just Got Bigger

Cloud-native isn’t contracting—it’s climbing up the stack. The Cloud Native Computing Foundation’s (CNCF’s) latest State of Cloud Native Development—done in partnership with SlashData—shows the community expanding beyond traditional Kubernetes operators into a much wider slice of backend developers who may never touch cluster primitives directly. That shift explains why some dashboards show container/Kubernetes “usage” leveling off even as cloud-native grows overall: the interface is moving up a layer to internal developer platforms and opinionated tooling.

“Cloud-native is moving from being a tech stack to a cultural shift in how developers interact with infrastructure,” said Bob Killen, senior technical program manager at CNCF. “It’s about empowering teams to build on top of a flexible, standardized foundation, not just running workloads in containers.”

What the Data Says

CNCF and SlashData estimate 15.6 million developers now qualify as cloud native, about 32% of the global developer population, with roughly 9.3 million in the traditional backend segment. Among developers who work on backend services, 56% are cloud native in Q3 2025—up from 49% in Q1 2025. Hybrid-cloud deployments climbed from 22% in early 2021 to 30% in Q3 2025, and multi-cloud sits at 23%. Meanwhile, only 41% of professional machine learning/artificial intelligence (ML/AI) developers identify as cloud native—likely because many consume AI via managed endpoints that abstract away the stack.

Why “Cloud-Native Without Kubernetes” Makes Sense

Killen described the pattern plainly in our interview: many backend developers now deploy through internal platforms like Backstage and other dev-portal tools rather than touching containers or Kubernetes directly. That doesn’t reduce the relevance of Kubernetes—it elevates it and makes it even more accessable. Teams “build once” to Kubernetes and point workloads to wherever capacity and cost line up, on-prem or cloud, without re-plumbing their developer workflow. This is the portability dividend the ecosystem bet on a decade ago.

“While AI/ML developers have infrastructure-heavy workloads, many don’t identify as cloud-native developers because they’re interacting with the infrastructure through abstracted layers like managed endpoints,” he said.

AI is Pushing Hybrid and Multi—Just Not Always Visibly

Hybrid-cloud’s steady rise isn’t a fashion cycle; it’s economics and capacity. GPU availability, compliance posture, and data-gravity considerations favor a mixed estate: local clusters for steady-state workloads, burst capacity in public clouds when queues spike, and selective use of specialized GPU instances for inference. The report’s trendline from 22% hybrid in 2021 to 30% in 2025 tracks what we hear from platform teams: design for flexibility first, then optimize per workload.

Inside the Tech Radar: What Developers Would Adopt Today

The CNCF and SlashData Tech Radar Report, which surveys what tools developers are actually using and recommending, points to a few emerging patterns:

AI inference engines and tools: NVIDIA Triton, DeepSpeed, TensorFlow Serving, and BentoML are placed in the adopt position, with Triton leading both maturity and usefulness in developer ratings. That’s consistent with what we see in production: a bias toward stable, vendor-backed or widely used inference stacks for latency-sensitive workloads.
Agentic AI platforms: Model Context Protocol (MCP) and Llama Stack land in adopt. MCP leads on maturity and usefulness in the Radar data. For shops experimenting with agent-to-agent workflows and tool invocation, this suggests a near-term path that doesn’t require inventing a framework from scratch.
ML orchestration: Airflow and Metaflow rise to adopt. Metaflow scores highest on maturity, while Airflow tops usefulness—pragmatic choices for teams bridging data engineering legacies with modern model pipelines.

What This Means

Here are a few observations from the CNCF/SlashData State of Cloud Native Development report:

1. Design Attention Is Moving to the Portal Layer: With 77% of backend developers using at least one cloud-native technology while many don’t identify as “Kubernetes users,” the center of gravity appears to be shifting toward internal developer platforms. Cost, performance, and security signals are increasingly surfaced in portals rather than in cluster-level tools.

2. Hybrid/Multi Is Becoming a Steady State: The report shows hybrid usage at 32% and multi-cloud at 26% among backend developers, with distributed cloud at 15%. Taken together, those shares suggest multi-venue deployment is becoming routine rather than exceptional, with Kubernetes serving as the portability layer across environments.

3. AI Plumbing Is Consolidating Around a Few Stacks: Many AI teams still consume managed endpoints, but the Technology Radar highlights a narrowing set of building blocks: Triton/DeepSpeed/TF Serving/BentoML for inference, MCP/Llama Stack for agentic scaffolding, and Airflow/Metaflow for orchestration. The pattern suggests a pragmatic core is emerging inside otherwise varied AI pipelines.

Why AI Developers Under-Index on “Cloud Native”

Only 41% of professional AI/ML developers are counted as cloud native in the study. That doesn’t mean they aren’t running on cloud-native infrastructure; it means consumption is often through higher-level SaaS or managed services where the platform owns the runtime. As more teams bring inference and retrieval closer to their data for cost, latency, or privacy, expect that percentage to rise—especially as internal developer platforms (IDPs) make “cloud-native-by-default” the path of least resistance.

What to Watch Next

Two mechanics will shape 2026 roadmaps.

IDP gravity. The more you can abstract cluster ops into templates and policies inside your portal, the faster you onboard non-specialist developers into cloud-native patterns without cognitive overload. Expect continued growth in dev-portal plug-ins that expose cost and performance insights contextually rather than via separate FinOps dashboards. The report’s “big-tent” framing for backend devs is an early indicator.
Agentic patterns meeting sitereliability engineering (SRE) guardrails. As teams test agent-to-agent workflows, governance will hinge on policy scopes, action audits, and clear escalation paths. The Tech Radar’s adopt signals around MCP and Llama Stack suggest a shared vocabulary is arriving; pair that with your existing change-management controls before letting anything write to production.

TechArena Take

Cloud-native isn’t fading; it’s moving up the stack. The center of gravity appears to be shifting from cluster primitives to internal developer platforms. Kubernetes continues to function as the portability layer, while more developers interact through portals and opinionated tools rather than directly with containers.

Hybrid and multi-cloud usage looks less like an edge case and more like standard operating context. The data suggests routine use of multiple execution venues as organizations balance capacity, cost, and locality considerations over time.

Developer sentiment around inference engines (e.g., Triton, DeepSpeed, TensorFlow Serving, BentoML), agentic scaffolding (MCP, Llama Stack), and orchestration (Airflow, Metaflow) points to a pragmatic core of components coalescing inside otherwise diverse AI pipelines.

Across interviews and releases, “agentic SRE” is taking shape as a layered pattern: explain-and-observe capabilities first, human-reviewed changes next, and policy-scoped autonomy for recurring fixes. Notable strides include transparent reasoning, auditable actions, and domain-scoped agents aimed at reducing error surface.

Two advancements stand out: platform-level immutability for backups that treats ransomware recovery as table stakes, and live container migration aimed at maintaining long jobs on ephemeral capacity. Both represent meaningful steps toward reliability at fleet scale without sacrificing economics.

Wanjiku Kamau on Her Book: Out of the Loop, Into the Algorithm

Allyson Klein talks with author and Google/Intel alum Wanjiku Kamau on moving past AI skepticism, learning fast, and using new tools with intention—so readers start where they are and explore AI with hope.

‍

Cloud computing network with connected data points

KubeCon Atlanta: CNCF Sets AI Standard for Kubernetes

The energy was palpable across the Georgia World Congress Center in Atlanta this morning as 9,000 people gathered for the 10th annual KubeCon + CloudNativeCon convention, where the Linux Foundation announced a brand new Kubernetes AI Conformance program, a community-driven certification aimed at making AI workloads portable and interoperable across Kubernetes platforms.

The opening keynotes drew a packed house and delivered a clear message: the next decade of cloud native will be defined by how well this community standardizes AI at scale.

It’s a fitting inflection point. This year marks the 10-year anniversary of the Cloud Native Computing Foundation (CNCF), and the program’s journey from a handful of seed projects to a global, high-velocity ecosystem is the backdrop for what comes next.

How We Got Here

The CNCF launched in 2015 under the Linux Foundation to steward a new operational model built around containers, orchestration, and declarative automation. The first CNCF Board meeting took place that December at The New York Times offices, and by March 2016, the Technical Oversight Committee had formally accepted Kubernetes as the foundation’s first project. Ten years later, the numbers tell the story: nearly 300,000 contributors across 190 countries have pushed 18.8 million contributions into more than 230 projects. The once-compact cloud native landscape now spans everything from core orchestration to observability, service meshes, security, data, and developer experience.

That community scale shows up in the audience, too: roughly 48 percent of attendees are first-timers, a reminder that cloud native keeps onboarding new builders even as it professionalizes.

Where We Are Today

The membership base has swollen from 22 founding organizations to more than 700 member companies—platinum and gold vendors, a deep bench of silver members, and a growing cadre of end-user organizations that help steer real-world priorities. A new platinum end user, CVS Health, was announced on stage, underscoring how cloud native has moved well beyond hyperscale tech firms into heavily regulated, mission-critical industries.

AI Continues to Grow at Staggering Scale

“The two most significant trends are merging right now—cloud native and AI are not separate technology trends; they are really coming together,” said Jonathan Bryce, executive director of cloud + infrastructure at the Linux Foundation.

That was the through-line from the main stage this morning. CNCF leaders framed AI in three layers—training, inference, and applications/agents—and called out inference as the near-term hotspot. The scale is staggering: Google said its systems jumped from about 980 trillion tokens per month to roughly 1.33 quadrillion tokens per month in just a few months, and every large enterprise is now under pressure to stand up reliable, cost-efficient AI services—not just proof-of-concepts.

Major Next Step: Kubernetes AI Conformance

To meet that moment, CNCF introduced the Kubernetes AI Conformance program, a community-driven certification aimed at making AI workloads portable and interoperable across Kubernetes platforms. Platforms that earn AI Conformance are expected to meet concrete requirements across six pillars:

Accelerators: hardware abstraction and scheduling for GPUs/TPUs and other accelerators (built on capabilities like Dynamic Resource Allocation, which graduated to GA in Kubernetes 1.34).

Networking: reliable, policy-aware connectivity for AI services.
Security: supply-chain and runtime controls suitable for production use.
Scheduling: predictable placement and scaling for AI workloads, including fractional or multi-node accelerator allocation where applicable.
Observability: standardized metrics and traces for models and accelerators.
Operators: lifecycle automation patterns that make AI stacks manageable.

Why It Matters

The original Kubernetes Conformance program is one of the quiet reasons cloud native scaled: it gave buyers confidence that distributions wouldn’t drift and that workloads would behave predictably across environments. AI needs the same discipline. Without it, teams get trapped in bespoke integrations, vendor-specific quirks, and fragile pipelines that are hard to operate at scale.

A live demo on stage walked through what an AI-conformant cluster looks like in practice: using DRA to discover accelerators and define resource plans; deploying a vision-language model; scraping model metrics; autoscaling via custom metrics; and exposing accelerator telemetry such as utilization and temperature. The point wasn’t the specific model—it was the proof that a consistent, open set of platform guarantees shortens the path from “it runs” to “it operates.”

Who’s In the Newly Formed AI-Kubernetes Club

Initial participants shown on the keynote logo wall include hyperscalers, enterprise platforms, and AI infrastructure providers such as Google Cloud, Microsoft Azure, AWS, NVIDIA, Red Hat, Oracle, SUSE, SAP, Akamai, Alibaba Cloud, Broadcom, CoreWeave, DaoCloud, and Kubermatic, among others. Expect that roster to grow quickly as vendors align their roadmaps and customers start asking for the badge.

How It Accelerates Adoption

By defining a minimum common denominator for accelerators, security, scheduling, observability, and operators, AI Conformance gives builders a stable target and gives organizations a portable operating model. Vendors can innovate above the line; users get fewer surprises when they move from lab to production or from one environment to another. It’s exactly the kind of boring, essential plumbing that lets the more exciting parts of AI—faster models, better retrieval, smarter agents—ship without reinventing the platform every time.

The Bigger Frame: Cloud Native and AI Are Merging

CNCF’s latest developer data puts the cloud-native population at 15.6 million, with nearly half already building AI systems. That overlap explains the energy in Atlanta: the community that figured out how to run the internet reliably now wants to make AI equally routine. The early signal is that Kubernetes will be the common substrate for AI not only because it’s ubiquitous, but because conformance programs like this one make it predictable.

TechArena Take

Standards are how ecosystems scale. Kubernetes AI Conformance is CNCF replaying a proven playbook at precisely the right layer of the stack. It won’t pick winners for model servers, vector databases, or agent frameworks—and it shouldn’t. Instead, it sets a floor for what every platform must guarantee so AI teams can move faster without stapling together one-off integrations for each environment.

Three implications to watch:

Procurement and platform strategy: expect AI Conformance to show up in RFPs. If you sell or operate Kubernetes at scale, your roadmap just got a new column.
Inference operations maturity: the demo emphasis on DRA, custom metrics, autoscaling, and accelerator telemetry is a tell—production AI is an operational problem first.
Portability pressure: as more providers certify, the switching cost narrative weakens. That’s healthy competitive pressure and a boon for enterprises trying to avoid lock-in.

Keep following TechArena.ai this week for updates and news from KubeCon + CloudNativeCon in Atlanta.

Cooling the Future: Where Does Cold Plate End and Immersion Begin?

I recently visited a customer whose AI racks were reaching 750–800W per slot. Their data center layout couldn’t push more airflow so they were in hot pursuit, no pun intended, of cold plate cooling alternatives. But as they forecasted system power forward, they saw a near term horizon where cold plate technology may not provide enough thermal mitigation to address their dense infrastructure demands. They faced the question of migration to cold plate now knowing that another technology migration may be required in the future, or take the plunge into immersion cooling now.

This customer is not alone. We have reached the choke point of air cooling within highly dense data center infrastructure, and more deployments are reaching for liquid cooling solutions. Today, that liquid cooling alternative is likely a cold plate solution, delivering the right mix of cooling efficiency and required thermal control. And while this transition is playing out in data centers today, many are asking how long cold plate solutions will keep pace with data center requirements. After all, today’s racks are climbing past 1 megawatt with data center facilities scaling past 1 gigawatt representing unprecedented heat to mitigate. This leads to the question of how long cold plate’s day in the sun will last before immersion cooling becomes a required alternative.

But what is cold plate? Cold plate solutions offer controlled liquid to chip and handle thermal densities significantly better than air cooling alternatives. Many HPC and AI boxes today already support cold plate solutions. It’s relatively mature, perceived to be controllable, and less disruptive to the data center to retrofit into brownfield environments. Up to a point, it works well!

At some point, though, cool plate solutions have reached an existential challenge of heat dissipation. Customers can experience leaks or thermal escapes with highly variable AI performance as system density scales. For these customers, immersion cooling (full immersion in dielectric fluid) offers an alternative. Immersion handles higher power densities with lower energy overhead, but it requires much more system certification to deploy safely.

At Intel, we see the coming of the immersion era, at least for high performance compute clusters. That’s why we’re helping to future proof infrastructure investment by certifying Xeon platforms for immersion, ensuring that CPUs behave reliably in immersive environments. This enables higher rack densities with confidence in stability and availability.

Moving into the Chill

So how should you approach the liquid alternatives? This has a lot to do with the density targets you’ve got on your infrastructure roadmap, and at what point you’ll hit a requirement for immersion. The time is now to evaluate cold plate solutions for immediate requirements and begin talking to vendors about immersion support. If you’re considering greenfield buildout, a transition to immersion sooner for your densest racks may make sense. In brownfield environments, take advantage of cold plate alternatives and their easier integration for the time being. Most importantly, strategically plan infrastructure within a long-term horizon to prioritize an efficient path through liquid cooling adoption with the right compute infrastructure support at each point in the migration path.

‍

Edge

Robert Bielby

Chameleon Semiconductor

Nov 10, 2025

Article

Under the Hood: The Alphabet Soup of Car Connectivity

The automotive industry’s introduction of the Controller Area Network (CAN) protocol in 1986 marked a significant departure from point-to-point wiring for electrical connections, which until then had been the mainstay of the industry. The shift to a relatively lightweight bus-based architecture was a nod to reality: electrical content in the vehicle was scaling fast, and alternatives to one-off wiring were needed.

The first vehicle to use the CAN bus was a Mercedes-Benz S-Class in 1991. CAN connected five Electronic Control Units (ECUs) for engine, body, and climate control. That moment marked the starting point for the evolution—if not revolution—of connectivity standards in the automobile, and it set the stage for architectural disruption.

Today’s Software-Defined Vehicle (SDV) is embracing a zonal architecture, a connectivity scheme based primarily on physical location rather than the specific capability of any given actuator or sensor. This approach typically uses about 300 meters of wiring, a reduction of roughly 4,700 meters compared with earlier distributed designs—a substantial savings in both weight and cost. The zonal model leans on a myriad of connectivity technologies that deliver more robustness, reliability, and deterministic timing than prior schemes. Those improvements are essential as electronic subsystems take over greater—sometimes complete—control of the vehicle.

A useful analogy: connectivity in a car is the nervous system. It must be responsive and provide failover. The nervous system not only links the senses but also controls muscles in response to the brain. Vehicles aren’t any different.

Since CAN, many other connectivity types have been introduced to address different functions in the vehicle. Some, like Ethernet, were adapted from mainstream computing and retrofitted to automotive. Those adaptations generally address real-time, deterministic responsiveness and fault detection. Interestingly, with the possible exception of CAN, standards defined specifically for automotive haven’t found broad adoption elsewhere.

A brief aside on EMI: a wire is, electrically, an antenna. That simple model explains a lot. Both radiated energy limits and required immunity are governed by industry standards. If not managed properly, high-speed signals across long wires can create (and suffer from) EMI. Unfortunately, the need for high-performance communications is at odds with minimizing emissions. In automotive, nothing about this is easy—we just tend to take it for granted.

What follows is a quick survey of the alphabet soup now in play.

Body domain / general-purpose connectivity

These standards are either still in use or have been replaced by newer alternatives:

CAN (Controller Area Network): used for powertrain, airbags, ABS, and battery management. It has survived because it’s simple and low cost. Historically limited to about 1 Mbit/s.
CAN-FD (Flexible Data-Rate): a faster derivative of CAN (up to ~8 Mbit/s) that eases CAN’s rate limitations while staying cheaper than Ethernet.
LIN (Local Interconnect Network): very low-rate communications (~19 kbit/s) used in body control and anywhere cost is more important than performance.
FlexRay: 10 Mbit/s with deterministic timing and fault protection; optimized for drive-by-wire steering. Over time, it has largely been replaced by automotive-optimized Ethernet.

Two primary technologies have dominated here:

MOST (Media-Oriented Systems Transport): largely defunct now. It targeted premium in-vehicle entertainment using optical links—great EMI immunity, but high cost and connector reliability issues. Max rate around 150 Mbit/s proved too slow for today’s 4K dash cams.
LVDS / FPD-Link (Flat Panel Display Link): low-voltage differential signaling with framing for displays. Low swing and differential pairs reduce emissions, and differential reception boosts noise immunity. Still a workhorse for displays and cameras.

A Special Mention: A2B

A2B (Automotive Audio Bus), introduced by Analog Devices in 2014, uses low-cost unshielded twisted pair to carry audio from a head unit (master) to slave devices like speakers and microphones. It supports multiple channels of high-resolution digital audio and microphone arrays for hands-free calling and adaptive noise cancellation.

High-Resolution Sensors and Displays (SerDes family)

For lidar, cameras, high-resolution surround view, and driver information displays, SerDes links embed clocking within the data stream to achieve high rates with low latency:

GMSL (Gigabit Multimedia Serial Link), FPD-Link, and ASA Motion Link (Automotive SerDes Alliance) support data rates into the multi-gigabit range (commonly up to the mid-teens of Gbit/s) with relatively low latency.
MIPI (Mobile Industry Peripheral Interface): originally for smartphones (camera and display), now widely adopted in automotive, with extensions for functional safety and security essential to mission-critical use.

Automotive Ethernet (and Switching)

There are multiple (about seven) automotive Ethernet derivatives tuned for a wide spectrum of in-vehicle needs, from 10 Mbit/s to 25 Gbit/s, with different reaches and price points. They address everything from “CAN-plus” body functions to ECU-to-ECU backbones. All use differential signaling; most ride over unshielded single twisted pair to minimize cost and weight. Alongside PHYs, the associated switching fabric has also been adapted for automotive.

Time-Sensitive Networking (TSN)

Standard Ethernet is best-effort. TSN Ethernet adds determinism: end-to-end camera-to-actuator latency under 5 milliseconds with less than 50 microseconds of jitter is achievable. That performance and the ability to prioritize time-critical traffic make Ethernet viable for emergency braking. TSN is a family of specifications; several variants address time sync, scheduling, stream reservation, and reliability.

Wireless in and Around the Vehicle

Bluetooth: audio and phone-based keys.
Wi-Fi: over-the-air updates and rear-seat mirroring.
C-V2X: vehicle-to-vehicle and vehicle-to-infrastructure communications. While multiple variants have been floated over the years, C-V2X is emerging as the superset.

Key Takeaways

Point-to-point wiring is mostly a thing of the past.

Weight and cost pressures drove alternatives; zonal architectures dramatically trim both.
CAN signaled the first big inflection; many link types now coexist, each optimized for its job.

In automotive, cost always matters; lower-cost solutions win when their features suffice.
Connectivity is the nervous system—reliability, robustness, latency, and determinism matter more as electronics take control.
Proven standards keep adapting: TSN Ethernet is becoming a mainstay, and MIPI’s smartphone heritage now carries automotive-grade safety and security.

Expansive city of high rise buildings with abstract digital security wall

How Financial Services Can Responsibly Scale AI

As AI adoption accelerates across industries, financial services takes the front line of both innovation and risk. From fraud detection to customer personalization, AI is reshaping how institutions operate. But the sector’s high stakes and regulatory complexity demand a uniquely careful approach.

At the recent AI Infra Summit in Santa Clara, Jeniece Wnorowski and I sat down with FinTech expert Anusha Nerella for a Data Insights conversation about how financial organizations can responsibly scale AI, stay ahead of fraudsters, and build teams equipped for the future.

“Many institutions are still in the early stages of AI deployment, while bad actors are moving fast and experimenting aggressively,” Nerella said.

This dynamic creates an urgent need for stronger, more agile defenses. Nerella emphasized that financial firms must accelerate their AI implementation cycles without sacrificing the governance and compliance guardrails that define the industry.

Regulatory and Team Culture Shifts

Asked what the broader technology ecosystem should do to support responsible AI in finance and enterprise, Nerella returned to the importance of regulatory alignment.

“Everything has to go through the regulatory and compliance [process] in order to make it responsibly…applicable to the enterprise sector,” she said.

But regulations alone aren’t enough. Nerella believes that financial institutions must rethink team structures and knowledge transfer to keep pace. She advocates for what she calls “reverse training,” in which organizations bring in engineers well-versed in AI frameworks and libraries, then combine their expertise with the strategic experience of senior leaders.

By fostering two-way collaboration between new AI talent and experienced financial professionals, companies can build stronger, future-ready teams.

“It becomes… a collaborative effort for sure,” Nerella explained. “It’s an equal opportunity here because whoever [has] decades of experience…might have limited exposure towards AI-based frameworks or library utilization or hands-on experience.”

This equal exchange of knowledge, she argued, is essential for success.

Start Small, Stay Governed

For organizations just beginning their AI journeys, Nerella’s advice is both practical and pointed: don’t try to boil the ocean. She recommends starting with “two or three clear use cases with ROI” and ensuring that governance and control mechanisms are in place from the outset.

“When you follow all these basic principles, then you will be able to see…result-oriented AI-based implementation from your end,” she said.

Throughout the conversation, she underscored that AI success in financial services requires human-in-the-loop collaboration.

TechArena Take

The financial sector’s high regulatory stakes, complex legacy systems, and relentless fraud threats make its AI journey distinct. Nerella’s insights highlight that the path forward isn’t just about technology—it’s about culture, compliance, and collaboration.

To build responsible and trusted AI systems, financial organizations must:

Accelerate adoption cycles without compromising governance.
Foster reverse training models that marry AI expertise with institutional knowledge.
Start with targeted, ROI-driven use cases, rather than sprawling transformations.

As the industry races to stay ahead of increasingly sophisticated fraud tactics, success will depend on balancing agility and accountability.

Watch the full podcast | Subscribe to our newsletter

VAST and CoreWeave Sign $1.17B Deal to Power AI Data Layer

VAST Data announced a $1.17 billion commercial agreement with CoreWeave that makes the VAST AI OS the primary data foundation for CoreWeave’s AI cloud, extending a collaboration that began when CoreWeave selected VAST to power its GPU cloud storage layer in 2023.

AI clouds are maturing from GPU-first builds to balanced, data-aware platforms that can keep training pipelines fed while serving real-time inference at scale. In that context, the data layer isn’t a bolt-on—it’s table stakes. VAST and CoreWeave are formalizing that reality in dollar terms and roadmap alignment.

The companies describe a multi-year commercial agreement that cements VAST as CoreWeave’s primary data platform. The release emphasizes instant access to massive datasets, reliability at cloud scale, and performance across both training and inference. It also highlights a “new class of intelligent data architecture” aimed at continuous training and real-time processing.

While detailed term length wasn’t disclosed, outside reporting characterizes the pact as multi-year and situates it within the broader generative-AI infrastructure build-out, noting VAST’s momentum and revenue trajectory this year.

How the Stack Comes Together

CoreWeave is known for GPU-accelerated infrastructure tailored for AI/ML, rendering/VFX, and other compute-intensive workloads. VAST’s AI OS consolidates data and compute services, with the company positioning its DASE architecture as a parallel distributed system designed to remove trade-offs between performance, scale, and resilience. In practical terms, the pitch is a single, scalable substrate that can be deployed across any CoreWeave data center to support both throughput-heavy training and latency-sensitive inference paths.

Two strategic threads stand out:

Data locality and pipeline efficiency: Training clusters need sustained bandwidth and predictable data access; inference fleets need fast retrieval against ever-fresh models and features. Making VAST the primary data foundation is meant to reduce friction between those modes so customers can iterate faster with fewer data-movement penalties.
Product-roadmap lock-step: The partners say they’ll co-deliver “sophisticated data services across the full stack,” which reads as deeper integration of storage, metadata, caching, and possibly agentic or workflow services directly in the platform—less glue code for customers, more out-of-the-box performance.

Why It’s Credible

This isn’t a green-field pairing. CoreWeave first named VAST as the data platform for its NVIDIA-powered AI cloud back in 2023, and since then both companies have scaled rapidly alongside enterprise AI adoption. Today’s announcement formalizes that relationship with a sizable commercial framework and sets expectations around platform primacy.

Market Context

AI infrastructure buyers are hunting for time-to-value: they want capacity that stands up quickly, sustains training throughput, and serves inference without spiraling costs. That’s pushing clouds—especially specialized “neo-clouds” like CoreWeave—to harden their data planes with predictable performance and global operability. The $1.17B figure signals that in the AI era, the data layer is where performance, reliability, and unit economics converge. External coverage also notes VAST’s broader customer footprint and fundraising signals, reinforcing the company’s position as more than a storage vendor—it’s pitching a full AI operating substrate.

What Customers Should Watch

Performance under mixed workloads: Can the same data substrate deliver both the streaming throughput that training demands and the low-latency retrieval inference needs without costly duplication?
Operational simplicity: If VAST’s AI OS truly consolidates services, expect fewer moving parts for customers and faster onboarding of new regions or clusters inside CoreWeave’s footprint.
Economics at scale: The promise here includes “cloud-scale economics.” Keep an eye on storage-to-compute ratios, egress patterns, and the amount of data staging you can eliminate as pipelines evolve.

What’s Next

VAST says the partnership will continue to evolve with shared product development. An analyst community briefing with CEO Renen Hallak on Thursday, November 13, will unpack strategy implications and additional updates.

TechArena Take

The signal here isn’t just the dollar figure—it’s the architectural vote: CoreWeave is betting that a unified, software-defined data plane is indispensable to AI cloud differentiation. For VAST, “AI OS” stops being slideware and becomes contractually central to one of the most prominent AI clouds. The near-term win is customer experience—simpler pipelines, faster iteration, fewer knobs to turn. The longer-term implication is competitive pressure: hyperscalers and other neo-clouds will need similarly opinionated data stacks that erase the gaps between training and inference. If VAST and CoreWeave can translate this alignment into measurable SLA gains and lower delivered cost per token/frame/query, this deal will read as a blueprint for how AI clouds professionalize the data layer at scale.

Welcome tothe Forum

Infleqtion and the Rise of Hybrid CPU GPU Quantum Systems

Expanso's David Aronchick on Data Gravity and Pipeline Debt

TechArena Wins 2 Communicator Awards for Editorial Excellence

Your Password is Being Cracked By Something That Never Sleeps

AI’s Power Crunch Is Making Brownfield the New Greenfield

Taking AI to the Diagnostic Lab with Care

qBraid on Quantum Computing: From Hype to Developer Reality

Submer's Gabriel Lazar on Heat Reuse and AI Sustainability

Laura St. John on Financial Discipline in the AI Era

Data Insights: Krishna Subramanian on Data & AI Costs

ZeroPoint Technologies Identifies the Real AI Infra Bottleneck

Dell is Building the Classical Foundation for the Quantum Era

AI’s Power Crunch Is Making Brownfield the New Greenfield

Taking AI to the Diagnostic Lab with Care

qBraid on Quantum Computing: From Hype to Developer Reality

Submer's Gabriel Lazar on Heat Reuse and AI Sustainability

Scaleway's Albane Bruyas on Europe's Sovereign AI Cloud Era

OCP EMEA Summit 2026: The AI Data Center Gets Its Blueprint

OCP Advances the Open Data Center Ecosystem Vision at EMEA Summit

Data Insights: Enterprise AI with Adity Dokania

Dana Bos: Culture Isn’t a Perk. In AI, It’s Infrastructure

Why Ultra Ethernet Matters for the Software-Defined Vehicle

Laura St. John on Financial Discipline in the AI Era

Data Insights: Krishna Subramanian on Data & AI Costs

ZeroPoint Technologies Identifies the Real AI Infra Bottleneck

From Operators to Innovators: Inside Midas Immersion Cooling

From Necessity to Innovation

The Physics of Immersion

The Thermal Recovery Advantage

Practical Deployment Considerations

The Midas Difference

TechArena Take

AI, the Silent Partner Mitigating Clinician Burnout

The Burnout Epidemic

AI as a Workflow Optimizer

Clinical Decision Support: Reducing Cognitive Load

Real-World Examples

The Human-AI Partnership

Future Outlook: AI as a Resilience Tool

Valvoline Global Operations Engineers Data Center Liquid Cooling

Two Liquid Cooling Solutions to Meet the Density Challenge

Precision Testing and Partnership: Engineering Compatibility at Every Level

Sustainability Through Operational Efficiency

The TechArena Take

Equinix on Architecting the AI-Ready Data Center

6 Plays to Close the AI-Era Connectivity Gap

Play 1: Redefine the Problem

Play 2: Admit Fiber Is Essential, but Not Sufficient

Play 3: Use Each Medium Where It Wins

Play 4: Design Hybrid

Play 5: Tame the Operational Offset

Play 6: Build for Dissimilar Redundancy and AI-Era Resilience

The Leadership Challenge

OCP 2025: Arm’s Chiplet Play Aims To Democratize AI Compute

TechArena Take

Physical AI in Production: Datara AI’s Data-Loop Edge Playbook

Why the Market is Ready

The Architecture That’s Working

The Workforce Reality

TechArena Take

Inside the New Rules of Responsible AI Governance

Defining Responsible AI Governance

Principles of Responsible AI

The Policy Landscape: U.S. and Europe

Why Policymakers Care

The Path Forward

Efficiency: The New Moat for Data and AI Teams

Efficiency Is the Real Moat in AI and Data Engineering

Architecture Matters More Than Algorithms

Sustainability Metrics Will Shape the Next Wave of Innovation

Constraints Drive Creativity

Efficiency Is Becoming the Language of Leadership

Conclusion

Resilience in the Age of AI: Inside Commvault’s ResOps Pitch

From Cyber Resilience to AI Resilience

ResOps: Category Creation or Real Operational Shift?

Identity Becomes the New Perimeter

Clean Recovery as a First-Class Outcome

Cloud, On-Prem, and the Unity Story

Welcome to
the Forum

Physical AI in Production: Datara AI’s Data-Loop Edge Playbook

Why the Market is Ready 

The Architecture That’s Working

The Workforce Reality 

TechArena Take