Discover all the highlights from OCP > VIEW our coverage
X

WEKA: Make Tokens Flow with Memory-Like Storage

October 28, 2025

As enterprises move artificial intelligence (AI)-based solutions further into production, inference speed is becoming a key factor in whether deployments succeed or fail. Real business value, and real infrastructure challenges, lie in how quickly models can generate responses for end users.

In a recent TechArena Data Insights episode, I spoke with Val Bercovici, chief AI officer at WEKA, and Scott Shadley, director of thought leadership at Solidigm, to explore how inference workloads are exposing infrastructure bottlenecks that threaten AI economics. Their conversation revealed why a metric called time to first token has become essential for measuring inference performance, and how storage architecture designed for this phase of AI can transform both productivity and profitability.

The New Currency of AI Performance

In a relatively short time, one metric has emerged to measure AI responsiveness: time to first token. As Val and Scott explained, this is a measure of the time it takes for a model to respond to a given prompt. It has emerged as a key metric because it directly translates to business value. “Time to first token literally translates to revenue, OPEX, and gross margin for the inference providers,” Val said.

As a concrete example, Val cited real-time voice translation, where instantaneous responses are critical to natural conversation. “Who wants to wait an awkward, pregnant pause of 30, 40 seconds for a translation?” Val asked. “We want that to be real-time and instantaneous, and time to first token is a key metric for that kind of use case.”

The metric matters because it reflects deeper infrastructure realities for AI inference workloads. Behind the response to every prompt lies a complex process that can be considered in two phases. In the pre-fill phase, prompts are converted into tokens and then expanded into key value (KV) cache, essentially the working memory of the large language model. Then in the decode phase, the model generates the actual output users see. Graphics processing units (GPUs) are currently being asked to do both at once, which is, in Val’s words, “a very expensive kind of context switching,” and the latter phase makes extreme demands of memory as well.

Rethinking Storage Architecture for AI Workloads

The conversation revealed how AI workloads differ fundamentally from traditional computing. Modern GPUs contain over 17,000 cores compared to a CPU’s hundred cores, creating entirely different performance requirements. This architectural shift demands a fresh approach to storage design, one that treats solid-state drives (SSDs) not merely as storage devices, but as memory extensions.

WEKA’s NeuralMesh Axon technology demonstrates this evolution. The solution embeds storage intelligence close to the GPU and creates software-defined memory from NVMe devices, allowing inference servers to see NVMe storage as memory, delivering memory-level performance from SSD hardware. This approach addresses one of inference computing’s most significant challenges: providing sufficient memory bandwidth to feed GPU cores without incurring prohibitive costs.

The Assembly Line Problem

One of the discussion’s most striking revelations centered on what Val termed the “assembly line” problem. While data centers optimized for AI are often described as “AI factories,” AI inference today operates more like a job shop than an assembly line. Data movement remains inefficient, causing expensive re-pre-filling operations and consuming kilowatts each time.

This inefficiency manifests in real-world constraints that AI users encounter daily. The rate limits imposed by AI service providers reflect the genuine economics of token generation. Coding agents and research tools that consume 100 to 10,000 times more tokens than simple chat sessions can’t be served profitably at current infrastructure costs, forcing providers to limit access even to customers willing to pay premium prices.

The Path Forward for Enterprises

Scott and Val offered practical guidance for IT leaders navigating the transition from proof-of-concept projects to production AI deployments. Scott stressed the importance of aligning hardware and software planning, noting that AI infrastructure demands closer collaboration between traditionally siloed teams. Val encouraged leaders to approach AI infrastructure with fresh perspectives, setting aside assumptions from previous technology generations.

The TechArena Take

As AI moves from experimental projects to production workloads generating measurable business value, infrastructure choices increasingly determine competitive advantage. Organizations that optimize storage architecture for token economics position themselves to scale AI profitably, while those applying traditional storage approaches risk creating bottlenecks that limit innovation. The enterprises that act decisively today in implementing high-performance storage architectures designed specifically for AI workloads will find themselves better positioned to capitalize on AI’s transformative potential.

For more information on WEKA’s AI infrastructure solutions, visit WEKA.io. Learn about Solidigm’s AI-optimized storage innovations at solidigm.com/ai.

Subscribe to our newsletter

As enterprises move artificial intelligence (AI)-based solutions further into production, inference speed is becoming a key factor in whether deployments succeed or fail. Real business value, and real infrastructure challenges, lie in how quickly models can generate responses for end users.

In a recent TechArena Data Insights episode, I spoke with Val Bercovici, chief AI officer at WEKA, and Scott Shadley, director of thought leadership at Solidigm, to explore how inference workloads are exposing infrastructure bottlenecks that threaten AI economics. Their conversation revealed why a metric called time to first token has become essential for measuring inference performance, and how storage architecture designed for this phase of AI can transform both productivity and profitability.

The New Currency of AI Performance

In a relatively short time, one metric has emerged to measure AI responsiveness: time to first token. As Val and Scott explained, this is a measure of the time it takes for a model to respond to a given prompt. It has emerged as a key metric because it directly translates to business value. “Time to first token literally translates to revenue, OPEX, and gross margin for the inference providers,” Val said.

As a concrete example, Val cited real-time voice translation, where instantaneous responses are critical to natural conversation. “Who wants to wait an awkward, pregnant pause of 30, 40 seconds for a translation?” Val asked. “We want that to be real-time and instantaneous, and time to first token is a key metric for that kind of use case.”

The metric matters because it reflects deeper infrastructure realities for AI inference workloads. Behind the response to every prompt lies a complex process that can be considered in two phases. In the pre-fill phase, prompts are converted into tokens and then expanded into key value (KV) cache, essentially the working memory of the large language model. Then in the decode phase, the model generates the actual output users see. Graphics processing units (GPUs) are currently being asked to do both at once, which is, in Val’s words, “a very expensive kind of context switching,” and the latter phase makes extreme demands of memory as well.

Rethinking Storage Architecture for AI Workloads

The conversation revealed how AI workloads differ fundamentally from traditional computing. Modern GPUs contain over 17,000 cores compared to a CPU’s hundred cores, creating entirely different performance requirements. This architectural shift demands a fresh approach to storage design, one that treats solid-state drives (SSDs) not merely as storage devices, but as memory extensions.

WEKA’s NeuralMesh Axon technology demonstrates this evolution. The solution embeds storage intelligence close to the GPU and creates software-defined memory from NVMe devices, allowing inference servers to see NVMe storage as memory, delivering memory-level performance from SSD hardware. This approach addresses one of inference computing’s most significant challenges: providing sufficient memory bandwidth to feed GPU cores without incurring prohibitive costs.

The Assembly Line Problem

One of the discussion’s most striking revelations centered on what Val termed the “assembly line” problem. While data centers optimized for AI are often described as “AI factories,” AI inference today operates more like a job shop than an assembly line. Data movement remains inefficient, causing expensive re-pre-filling operations and consuming kilowatts each time.

This inefficiency manifests in real-world constraints that AI users encounter daily. The rate limits imposed by AI service providers reflect the genuine economics of token generation. Coding agents and research tools that consume 100 to 10,000 times more tokens than simple chat sessions can’t be served profitably at current infrastructure costs, forcing providers to limit access even to customers willing to pay premium prices.

The Path Forward for Enterprises

Scott and Val offered practical guidance for IT leaders navigating the transition from proof-of-concept projects to production AI deployments. Scott stressed the importance of aligning hardware and software planning, noting that AI infrastructure demands closer collaboration between traditionally siloed teams. Val encouraged leaders to approach AI infrastructure with fresh perspectives, setting aside assumptions from previous technology generations.

The TechArena Take

As AI moves from experimental projects to production workloads generating measurable business value, infrastructure choices increasingly determine competitive advantage. Organizations that optimize storage architecture for token economics position themselves to scale AI profitably, while those applying traditional storage approaches risk creating bottlenecks that limit innovation. The enterprises that act decisively today in implementing high-performance storage architectures designed specifically for AI workloads will find themselves better positioned to capitalize on AI’s transformative potential.

For more information on WEKA’s AI infrastructure solutions, visit WEKA.io. Learn about Solidigm’s AI-optimized storage innovations at solidigm.com/ai.

Subscribe to our newsletter

Transcript

Subscribe to TechArena

Subscribe