ZeroPoint Technologies Identifies the Real AI Infra Bottleneck

There’s a conversation happening at the edges of the AI infrastructure world that hasn’t quite broken through to the mainstream yet. It’s not about which GPU cluster wins the benchmark race or which hyperscaler is adding the most capacity. It centers on something far more fundamental: the cost of moving data.

In a recent Data Insights episode, I sat down with Solidigm’s Jeniece Wnorowski and Nilesh Shah, VP of Business Development at ZeroPoint Technologies, to work through where this friction in modern AI systems lives.

Ten Times More Expensive to Move Than to Compute

Nilesh began with an often-overlooked aspect of data storage: the amount of power moving data takes. Moving a single bit of data from storage, through high-bandwidth memory or low-power double data rate (LPDDR), and into the on-chip static random access memory (SRAM) where computation actually happens costs roughly ten times more in power than performing the computation itself. That ratio explains why inference chip innovators like Groq, Cerebras and SambaNova are focusing on data movement and memory hierarchies over compute.

Zero Point Technologies was founded on the premise that the need for data and memory is going to increase rapidly, and one of the ways to tackle that challenge is through lossless memory compression. By reducing the volume of data physically moving across the system, you increase the effective bandwidth and capacity of the compute engine.

Agentic AI Compounds the Problem

On the question of whether AI workflows were being constructed correctly for the management of data, and how this could change as enterprises start scaling inference into different parts of their business, Nilesh pointed out that the key problem to be solved is agentic AI entering the workflow.

A pattern seen at recent tech conferences was that chip designers were integrating multiple specialized AI agents into a single electronic design automation (EDA) workflow, each handling a distinct task, like error detection or chip verification. This would mean having domain-specific inference solutions for even EDA operations, fundamentally changing the way enterprises will need to think about data.

Prefill vs Decode Stage

As data becomes a challenge, memory bandwidth could become a bottleneck. Nilesh pointed out that agentic workflows and inference takes place in two stages, prefill and decode. The prefill stage processes the input prompt and is genuinely compute intensive. Modern GPU clusters handle this part reasonably well. The decode stage, where the output is generated, is extremely memory intensive and is what’s really limiting tokens per second.

When it comes to responsiveness at enterprise scale, say 100,000 employees simultaneously interacting at that scale across multiple streams of data, the decode phase becomes a real bottleneck. At NVIDIA GTC 2026, a lot of the keynotes revolved around developing heterogenous architectures that can manage the decode phase more efficiently.

We talked about when quantum computing would enter the picture. “What is the ChatGPT moment for quantum computing? That’s the favourite question I like to ask,” said Nilesh. He predicted that it could make sense to attach quantum processing units to data centers to efficiently offload some of the compute work that quantum tends to do well. There are currently examples of banks deploying early quantum computers, and another use case could be encryption and creating more secure encryption protocols.

Future Innovation in Memory

When I asked Nilesh what he sees on the horizon for memory and storage technology, he outlined three distinct directions where investment and innovation are converging.

The first is alternative memory technologies. Dynamic random-access memory (DRAM) is a decades-old architecture that hasn’t changed fundamentally, and its limitations are starting to bite at exactly the moment AI workloads are scaling fastest. The second is new interfaces between memory and compute that will transform how memory communicates with the compute engine.

The third is the most significant shift in perspective: the unit of infrastructure design is moving from the chip, to the server, to the rack, and now to the data center as a single coherent system. Organizations are thinking about AI infrastructure in terms of megawatts allocated to a data center, with memory, storage, and compute all traded off within that power budget.

The biggest misconception, he felt, was the assumption that scaling AI output will keep being built on a proportional increase in power. “I expect a breakthrough that someone will come up with an entirely new style of physics that will break that linear assumption that to go from 100 LLMs to a million or going from a million users to 100 million, we’ll just multiple the megawatts of power,” he said.

The TechArena Take

My conversation with Nilesh clarified a change in direction I’ve noticed at many recent tech conferences. The 10x cost differential between moving data and computing on it is the reason the entire inference chip landscape looks the way it does. It’s a significant engineering constraint that companies like ZeroPoint are building directly against. The prefill-decode distinction matters because enterprises planning inference deployments at scale need to architect around the decode phase as a distinct bottleneck.

We’re excited to see what new innovations take place in the memory space, and if, as Nilesh believes, someone will eventually find a way to scale AI without the linear progression of more compute meaning more power.