
Inside KubeCon 2025: How AI Is Rewriting Kubernetes Operations
Self-healing has long been Kubernetes’ north star: restart failed pods, reschedule workloads, reconcile desired state, and keep applications running through everyday failures. But AI is piling on new pressure as teams run GPU-hungry models, mix batch and real-time inference, and stretch Kubernetes across fleets of clusters and clouds. At enterprise AI scale, that pressure lands on site reliability engineering (SRE) and platform teams, who have to reason about GPU scarcity, token volume, spiky inference, and large Kubernetes fleets all at once.
Hundreds of clusters, thousands of services, and a flood of change create too many variables for humans to chase in real time. The question is no longer whether Kubernetes can restart a pod; it’s whether platforms encode SRE judgment into systems that act quickly, safely, and with an audit trail on their behalf.
At KubeCon + CloudNativeCon North America in Atlanta this week, chatter about “agentic SRE” could be heard up and down the massive showroom floor of the Georgia World Congress Center; meanwhile, the Cloud Native Computing Foundation (CNCF) unveiled its new Kubernetes “AI Conformance” push during the opening keynotes. Jonathan Bryce, executive director of cloud and infrastructure at CNCF and Chris Aniszczyk, CNCF chief technology officer, opened the sessions with a call to action to make AI workloads portable and interoperable across platforms, just as conformance once standardized Kubernetes itself for every major cloud service or private cloud option.
Against this backdrop, I spent a few days talking with exhibitors about agentic SRE and how AI is fundamentally changing Kubernetes operations.
Devtron, Komodor, and Dynatrace are each coming at that problem from a different angle. Devtron is collapsing application, infrastructure, and cost into a single view and layering in an agentic SRE interface. Komodor is turning static runbooks into a policy-scoped, multi-agent SRE that can self-heal fleets and even live-migrate pods off spot instances. Dynatrace is pushing observability from dashboards into decisions while asking whether AI is actually earning its keep.
Taken together, they sketch an ops layer that looks a lot like what AI Conformance is aiming for at the platform layer: standard patterns for how AI runs, heals, optimizes, and proves value on Kubernetes—without treating AI infrastructure stacks as fragile, one-off, bespoke environments that each need custom care and feeding.
Devtron: Merge Applications and Infrastructure, Hide Tool Chaos, Add an SRE Calculator
Devtron, an enterprise open-source Kubernetes management platform with more than 21,000 installations powering over 9 million deployments, launched Devtron 2.0 during Day 1 of the convention. The release adds an ‘Agentic SRE’ layer on top of its existing footprint to bring AI-powered autonomous operations to production Kubernetes that has to withstand catastrophic failures, ransomware, and high-availability demands at scale.
Devtron 2.0 starts with a very human problem: operators drowning in tools and organizational lines between applications and infrastructure. I chatted with CEO Ranjan Parthasarathy, who said the company wants to “simplify the lives of operators who are running Kubernetes in production.”
“Managing Kubernetes in production is challenging because, first of all, there are too many tools,” Parthasarathy said. “Second of all, there is a very clear line that separates applications and infrastructure management.”
Devtron 2.0 explicitly mimics Kubernetes’ own design.
“We have taken the approach Kubernetes took from day one, which is, they blurred the lines between app and infra in how Kubernetes is architected,” he said. “The APIs for app and infra are all the same. The way you capture app and infra in the form of manifests is all the same. So, why should manageability create an artificial separation?”
Devtron’s answer is a single environment where you can follow a problem from logs to infrastructure to cost without hopping through a half-dozen consoles, with integrated FinOps and GPU visibility so AI workloads are first-class citizens in that view.
According to Devtron, customers like BharatPe and 73 Strings are already using the platform to shrink release cycles from months to weeks and cut mean time to recovery from days to under an hour, which is the backdrop for everything Devtron is now doing with agentic SRE. Their agentic SRE layer walks the classic maturity curve: start with safe reads, then layer in human-approved changes.
“Explain is a feature that we have in our UI at select strategic places where, the minute an error happens, the user can say, ‘Explain.’ And it explains in human readable form what really happened,” Parthasarathy said.
The system also drafts remediation actions that humans review and, once tested in the wild, can bless as auto-apply for recurring conditions. The agent is more like a calculator than a replacement, Parthasarathy said.
Komodor: Autonomous AI SRE with Guardrails and Live Migration
Komodor is the autonomous AI SRE company for cloud-native infrastructure and operations. At KubeCon, the team highlighted new autonomous self-healing and cost optimization capabilities powered by Klaudia, a purpose-built agentic AI system that sits on top of Komodor’s existing Kubernetes troubleshooting platform.
Klaudia—a multi-agent system—sits closer to the hands-on-the-keyboard side of SRE. The company has run a Kubernetes troubleshooting platform for years; now they’re releasing an additional agentic AI layer on top of it, said Udi Hofesh, who works in product marketing and developer relations for the company.
“This enables the same great value autonomously, basically saving more time and providing more accurate, more expansive insights and recommendations,” he said.
The core idea is to turn Kubernetes’ reconciliation model into practical self-healing at fleet scale. Komodor’s media release about Klaudia leans hard into the scale of that problem. They cite industry data showing that 88% of technology leaders report rising stack complexity and cloud waste often exceeds 30% of total spend when misconfigurations and idle capacity linger. In one Cisco environment, the company says Klaudia helped cut ticket volume by roughly 40% and accelerated mean time to recovery by more than 80%.
“Kubernetes works and is built around reconciliation,” said Mickael Alliel, backend tech lead at Komodor.
Klaudia’s policies are designed to reconcile the workloads and applications of Komodor’s customers to always be healthy and in a working state, not just fire off static runbooks. That dynamic behavior is the key difference from traditional automation.
“The automatic runbooks or playbooks are, let’s say, something that doesn’t change,” Alliel said. “Klaudia, as the autonomous AI SRE, is able to do it a lot more dynamically… it acts as a real site reliability engineer as opposed to just a series of steps.”
With graph-wide context, Klaudia can pull telemetry from multiple namespaces and components and “get a root cause analysis up and running in as little as 15 or 30 seconds,” he said, which matters a lot when you’ve got one SRE for dozens of teams.
Guardrails are a big part of the story, especially for teams burned by LLM hallucinations.
“We actually try to enforce on Klaudia and the AI SRE as many safeguards as possible to ensure that the AI doesn’t hallucinate,” Hofesh said. “We try to ensure that, if it doesn’t know something, it will say. ‘I don’t know and I need more information’… instead of just spitting out something that is not true.”
Every action is logged, and an SRE can see both a full summary of all the actions that Klaudia has taken and the reasoning behind them.
“We gave (Klaudia) a name and a face, but it’s actually hundreds of agents that are interacting with each other,” Hofesh said. For each component in the cloud-native stack, there’s a domain expert agent, orchestrated by workflow agents that mimic SRE motions like detect, investigate, optimize.
Beyond incident response, Klaudia also pushes into cost optimization. Komodor is using it to dynamically right-size workloads, schedule pods to avoid idle resources and bin-packing dead-ends, and use their PodMotion capability to move pods and state across nodes with zero downtime so teams can chase cheaper capacity or handle infrastructure events without disrupting applications.
Dynatrace: AI Observability, ROI Pressure, and Active Decisioning
Dynatrace is an AI-powered observability and security platform that unifies application, infrastructure, log, and business data in a single data lakehouse and uses its Davis AI engine to turn that telemetry into real-time insights and automated remediation. It has had AI in its stack for more than a decade.
“We have been working in the AI space for over 12 years,” said Chief Technology Strategist Alois Reitbauer. “We were always the odd people out—the people doing stuff with AI for a very long time. Not generative AI, but AI in general, and machine learning. We use predictive AI to predict behavior and detect anomalies, then use causal AI to understand the root cause of a problem, to understand cause and effect.”
What’s shifted recently is the focus of AI observability. Early on, he said, it was about tokens and performance. Now that more systems are in production, the question has become whether it provides value.
“It’s not just, ‘How much money are we spending,’ but, ‘Do people actually get something out of it? Should we keep investing into it?’” he said.
Reitbauer pointed out that AI budgets aren’t created from thin air. He explained that, as companies move investment to AI from other areas, they have an expectation of ROI that is at least as high, if not higher, than before. He gave an example of a website that offers a product for $3, and pays $5 to generate the recommendation; not exactly a model of ROI.
On the plumbing side, he described observability’s progression from collecting data, to anomaly detection, to root-cause analysis and now to action.
“We’re moving into the next generation where tools are actually able to take action,” he said. Instead of just saying “your system is down, your servers are overloaded,” a next-gen system might say: “Your system is down, your servers are overloaded. I propose an immediate mitigation action to scale up from three to five servers… and I already created the PR, just click approve here.”
Long term, it can also surface proposals for the developer on how that code could potentially be rewritten to be more efficient.
Dynatrace’s internal agentic platform is wiring those pieces into workflows, Reitbauer said.
“Think of it as a low-code way of building an agent, almost,” he said.
The use cases line up with the themes from KubeCon: remediation workflows based on observability data, preventive workflows that reconfigure environments before trouble hits, and continuous optimization tuned to cloud environments.
TechArena Take
Kubernetes AI Conformance is about making AI workloads on Kubernetes interoperable and portable across a messy mix of models, frameworks, and hardware.
The companies I talked with are doing the same thing for operations: turning AI-heavy Kubernetes environments from bespoke into systems that can be monitored, healed, optimized, and justified at scale.
The interesting advances aren’t the fully autonomous slogans; they’re the boring-but-essential scaffolding behind them. These platforms are also shipping against real-world pain. Devtron points to customers like BharatPe and 73 Strings using its unified control plane to shrink release cycles, improve stability, and drive MTTR down from days to under an hour. Komodor cites Cisco’s platform engineering team cutting ticket loads by around 40 percent and improving MTTR by more than 80 percent as Klaudia moves from reactive triage to proactive self-healing and optimization.
Devtron’s merged application/infrastructure/cost view and human-in-the-loop agent treat autonomy like a calculator, not a replacement. Komodor’s domain-specific multi-agent approach attacks both the incident math and the spot-capacity economics. Dynatrace is pushing observability from a passive system of record toward an active participant that can propose or trigger changes—and then tie those moves back to business outcomes.
If Day 1 in Atlanta was about putting a floor under AI workloads with Kubernetes AI Conformance, these conversations were about building the mezzanine above it: how AI actually runs and proves itself in production. The self-healing promise of Kubernetes isn’t going away; it’s being reimplemented at the organizational layer—across clusters, costs, and teams—so platform and SRE leaders can keep up with AI-era workload diversity and autonomy without scaling humans linearly with every new deployment.
Self-healing has long been Kubernetes’ north star: restart failed pods, reschedule workloads, reconcile desired state, and keep applications running through everyday failures. But AI is piling on new pressure as teams run GPU-hungry models, mix batch and real-time inference, and stretch Kubernetes across fleets of clusters and clouds. At enterprise AI scale, that pressure lands on site reliability engineering (SRE) and platform teams, who have to reason about GPU scarcity, token volume, spiky inference, and large Kubernetes fleets all at once.
Hundreds of clusters, thousands of services, and a flood of change create too many variables for humans to chase in real time. The question is no longer whether Kubernetes can restart a pod; it’s whether platforms encode SRE judgment into systems that act quickly, safely, and with an audit trail on their behalf.
At KubeCon + CloudNativeCon North America in Atlanta this week, chatter about “agentic SRE” could be heard up and down the massive showroom floor of the Georgia World Congress Center; meanwhile, the Cloud Native Computing Foundation (CNCF) unveiled its new Kubernetes “AI Conformance” push during the opening keynotes. Jonathan Bryce, executive director of cloud and infrastructure at CNCF and Chris Aniszczyk, CNCF chief technology officer, opened the sessions with a call to action to make AI workloads portable and interoperable across platforms, just as conformance once standardized Kubernetes itself for every major cloud service or private cloud option.
Against this backdrop, I spent a few days talking with exhibitors about agentic SRE and how AI is fundamentally changing Kubernetes operations.
Devtron, Komodor, and Dynatrace are each coming at that problem from a different angle. Devtron is collapsing application, infrastructure, and cost into a single view and layering in an agentic SRE interface. Komodor is turning static runbooks into a policy-scoped, multi-agent SRE that can self-heal fleets and even live-migrate pods off spot instances. Dynatrace is pushing observability from dashboards into decisions while asking whether AI is actually earning its keep.
Taken together, they sketch an ops layer that looks a lot like what AI Conformance is aiming for at the platform layer: standard patterns for how AI runs, heals, optimizes, and proves value on Kubernetes—without treating AI infrastructure stacks as fragile, one-off, bespoke environments that each need custom care and feeding.
Devtron: Merge Applications and Infrastructure, Hide Tool Chaos, Add an SRE Calculator
Devtron, an enterprise open-source Kubernetes management platform with more than 21,000 installations powering over 9 million deployments, launched Devtron 2.0 during Day 1 of the convention. The release adds an ‘Agentic SRE’ layer on top of its existing footprint to bring AI-powered autonomous operations to production Kubernetes that has to withstand catastrophic failures, ransomware, and high-availability demands at scale.
Devtron 2.0 starts with a very human problem: operators drowning in tools and organizational lines between applications and infrastructure. I chatted with CEO Ranjan Parthasarathy, who said the company wants to “simplify the lives of operators who are running Kubernetes in production.”
“Managing Kubernetes in production is challenging because, first of all, there are too many tools,” Parthasarathy said. “Second of all, there is a very clear line that separates applications and infrastructure management.”
Devtron 2.0 explicitly mimics Kubernetes’ own design.
“We have taken the approach Kubernetes took from day one, which is, they blurred the lines between app and infra in how Kubernetes is architected,” he said. “The APIs for app and infra are all the same. The way you capture app and infra in the form of manifests is all the same. So, why should manageability create an artificial separation?”
Devtron’s answer is a single environment where you can follow a problem from logs to infrastructure to cost without hopping through a half-dozen consoles, with integrated FinOps and GPU visibility so AI workloads are first-class citizens in that view.
According to Devtron, customers like BharatPe and 73 Strings are already using the platform to shrink release cycles from months to weeks and cut mean time to recovery from days to under an hour, which is the backdrop for everything Devtron is now doing with agentic SRE. Their agentic SRE layer walks the classic maturity curve: start with safe reads, then layer in human-approved changes.
“Explain is a feature that we have in our UI at select strategic places where, the minute an error happens, the user can say, ‘Explain.’ And it explains in human readable form what really happened,” Parthasarathy said.
The system also drafts remediation actions that humans review and, once tested in the wild, can bless as auto-apply for recurring conditions. The agent is more like a calculator than a replacement, Parthasarathy said.
Komodor: Autonomous AI SRE with Guardrails and Live Migration
Komodor is the autonomous AI SRE company for cloud-native infrastructure and operations. At KubeCon, the team highlighted new autonomous self-healing and cost optimization capabilities powered by Klaudia, a purpose-built agentic AI system that sits on top of Komodor’s existing Kubernetes troubleshooting platform.
Klaudia—a multi-agent system—sits closer to the hands-on-the-keyboard side of SRE. The company has run a Kubernetes troubleshooting platform for years; now they’re releasing an additional agentic AI layer on top of it, said Udi Hofesh, who works in product marketing and developer relations for the company.
“This enables the same great value autonomously, basically saving more time and providing more accurate, more expansive insights and recommendations,” he said.
The core idea is to turn Kubernetes’ reconciliation model into practical self-healing at fleet scale. Komodor’s media release about Klaudia leans hard into the scale of that problem. They cite industry data showing that 88% of technology leaders report rising stack complexity and cloud waste often exceeds 30% of total spend when misconfigurations and idle capacity linger. In one Cisco environment, the company says Klaudia helped cut ticket volume by roughly 40% and accelerated mean time to recovery by more than 80%.
“Kubernetes works and is built around reconciliation,” said Mickael Alliel, backend tech lead at Komodor.
Klaudia’s policies are designed to reconcile the workloads and applications of Komodor’s customers to always be healthy and in a working state, not just fire off static runbooks. That dynamic behavior is the key difference from traditional automation.
“The automatic runbooks or playbooks are, let’s say, something that doesn’t change,” Alliel said. “Klaudia, as the autonomous AI SRE, is able to do it a lot more dynamically… it acts as a real site reliability engineer as opposed to just a series of steps.”
With graph-wide context, Klaudia can pull telemetry from multiple namespaces and components and “get a root cause analysis up and running in as little as 15 or 30 seconds,” he said, which matters a lot when you’ve got one SRE for dozens of teams.
Guardrails are a big part of the story, especially for teams burned by LLM hallucinations.
“We actually try to enforce on Klaudia and the AI SRE as many safeguards as possible to ensure that the AI doesn’t hallucinate,” Hofesh said. “We try to ensure that, if it doesn’t know something, it will say. ‘I don’t know and I need more information’… instead of just spitting out something that is not true.”
Every action is logged, and an SRE can see both a full summary of all the actions that Klaudia has taken and the reasoning behind them.
“We gave (Klaudia) a name and a face, but it’s actually hundreds of agents that are interacting with each other,” Hofesh said. For each component in the cloud-native stack, there’s a domain expert agent, orchestrated by workflow agents that mimic SRE motions like detect, investigate, optimize.
Beyond incident response, Klaudia also pushes into cost optimization. Komodor is using it to dynamically right-size workloads, schedule pods to avoid idle resources and bin-packing dead-ends, and use their PodMotion capability to move pods and state across nodes with zero downtime so teams can chase cheaper capacity or handle infrastructure events without disrupting applications.
Dynatrace: AI Observability, ROI Pressure, and Active Decisioning
Dynatrace is an AI-powered observability and security platform that unifies application, infrastructure, log, and business data in a single data lakehouse and uses its Davis AI engine to turn that telemetry into real-time insights and automated remediation. It has had AI in its stack for more than a decade.
“We have been working in the AI space for over 12 years,” said Chief Technology Strategist Alois Reitbauer. “We were always the odd people out—the people doing stuff with AI for a very long time. Not generative AI, but AI in general, and machine learning. We use predictive AI to predict behavior and detect anomalies, then use causal AI to understand the root cause of a problem, to understand cause and effect.”
What’s shifted recently is the focus of AI observability. Early on, he said, it was about tokens and performance. Now that more systems are in production, the question has become whether it provides value.
“It’s not just, ‘How much money are we spending,’ but, ‘Do people actually get something out of it? Should we keep investing into it?’” he said.
Reitbauer pointed out that AI budgets aren’t created from thin air. He explained that, as companies move investment to AI from other areas, they have an expectation of ROI that is at least as high, if not higher, than before. He gave an example of a website that offers a product for $3, and pays $5 to generate the recommendation; not exactly a model of ROI.
On the plumbing side, he described observability’s progression from collecting data, to anomaly detection, to root-cause analysis and now to action.
“We’re moving into the next generation where tools are actually able to take action,” he said. Instead of just saying “your system is down, your servers are overloaded,” a next-gen system might say: “Your system is down, your servers are overloaded. I propose an immediate mitigation action to scale up from three to five servers… and I already created the PR, just click approve here.”
Long term, it can also surface proposals for the developer on how that code could potentially be rewritten to be more efficient.
Dynatrace’s internal agentic platform is wiring those pieces into workflows, Reitbauer said.
“Think of it as a low-code way of building an agent, almost,” he said.
The use cases line up with the themes from KubeCon: remediation workflows based on observability data, preventive workflows that reconfigure environments before trouble hits, and continuous optimization tuned to cloud environments.
TechArena Take
Kubernetes AI Conformance is about making AI workloads on Kubernetes interoperable and portable across a messy mix of models, frameworks, and hardware.
The companies I talked with are doing the same thing for operations: turning AI-heavy Kubernetes environments from bespoke into systems that can be monitored, healed, optimized, and justified at scale.
The interesting advances aren’t the fully autonomous slogans; they’re the boring-but-essential scaffolding behind them. These platforms are also shipping against real-world pain. Devtron points to customers like BharatPe and 73 Strings using its unified control plane to shrink release cycles, improve stability, and drive MTTR down from days to under an hour. Komodor cites Cisco’s platform engineering team cutting ticket loads by around 40 percent and improving MTTR by more than 80 percent as Klaudia moves from reactive triage to proactive self-healing and optimization.
Devtron’s merged application/infrastructure/cost view and human-in-the-loop agent treat autonomy like a calculator, not a replacement. Komodor’s domain-specific multi-agent approach attacks both the incident math and the spot-capacity economics. Dynatrace is pushing observability from a passive system of record toward an active participant that can propose or trigger changes—and then tie those moves back to business outcomes.
If Day 1 in Atlanta was about putting a floor under AI workloads with Kubernetes AI Conformance, these conversations were about building the mezzanine above it: how AI actually runs and proves itself in production. The self-healing promise of Kubernetes isn’t going away; it’s being reimplemented at the organizational layer—across clusters, costs, and teams—so platform and SRE leaders can keep up with AI-era workload diversity and autonomy without scaling humans linearly with every new deployment.



