It’s No Coincidence that Wait-and-See Rhymes with Latency
Why Wait? Sometimes That’s the Only Choice
“All good things arrive unto them that wait - and don't die in the meantime.”
- Mark Twain
Memory is always a tricky thing. And we’re not just talking about trying to find the TV remote for five minutes only to finally discover that it’s in the fridge next to the five-pack. Don’t judge until you’ve done your 1.6 km in a certain pair of size 14’s, thank you. Anyway, Your Humble Author (YHA) remembers many a discussion on system memory, cache topologies, drive characteristics, and all those other fun things that are associated with an engineer’s most-dreaded, four-letter “L” word: latency. During one particular discussion with a senior software VP at a large OS company, he opined, “All processor architectures wait equally fast.” So, so true.
Pretty much every one of those computer architectures arranges the logic units to operate on a series of data registers. We promise this is going somewhere… and we’re also simplifying a lot so everyone can grab a bone and pick. Data registers load and store from memory. Some architectures will also allow for direct memory operations. But which memory? This is where waiting comes to play.
The following analogy has been changed to protect the innocent, but YHA thanks the smart people for the concept. Let’s say the desired memory is stored in the nearest cache (with typical latencies in the one nanosecond range these days). Let’s also say that acquisition is the equivalent of finding the TV remote again next to the three-pack. Easy. Processor architectures typically have a series of larger caches or on-chip memories these days, followed by DRAM on the motherboard. It could take a walk to the neighbor’s house for a quick wave while opening the fridge in the garage to accomplish that particular DRAM “fetch.”
If that particular data happens to be stored on some of the latest and best solid-state memory on a coherent bus, the equivalent would be a short drive to the local convenience store. Leave the remote at home, please. A slightly older SSD configuration – or a poor driver routine (processor people always blame the software) – could mean a slightly longer trip to the grocery store in town. The average AI cluster data transfer across the mesh would involve stopping for a sit-down dinner before heading home. That’s a lot of work for one individual piece of data, all while the particular system thread is waiting.
One more. The best network topologies in the datacenter these days usually guarantee a maximum of 5 milliseconds latency from any server to your register. THE COMPANY where YHA used to work offered the promise of an 8-week sabbatical. Were that entire sabbatical devoted to the acquisition of one beverage unit, we’d be in the ballpark of the latency we’re discussing. That sucker better not be an over-hopped IPA. And what’s the system thread doing that whole time? To use the Yiddish: bupkis.
Clearly, those decades of work between hardware and software geeks on multi-threading, data-ordering, data promotion, appropriate memory and SSD sizing, (keep going, it was decades) were all pretty vital. The modern datacenter rack topology is now largely being driven by collaborative, hive-mind organizations like the Open Compute Project (OCP). And even with all that effort, many of those cores are still essentially thirsting for that next drink of data. There’s a very cogent person a few TechArena articles away from this one that noted it can be very disappointing to pay $1B for some AI hardware only to see $500M of work happen. These are not inconsequential decisions.
So what to do? First, don’t panic (towel and exploding planet optional). There are plenty of resources out there that can help optimize whatever you’re building. Look for similar organizations in your region or technology circle and ask questions, many will be very open if they’re not competing with you, some even if they are. Look to organizations like OCP as a guide for setting a proper configuration. Also, your software providers likely have a much better view of configurations that best run their work, mostly because of those decades of work. Finally, look to cloud options, since your provider will then be taking the risks.
And keep the analogy going. Maybe in the flat world of data analytics we’ll collectively find out what data fetch has us all going to the moon and back. By the way, is anyone else thirsty? And where the heck is the remote?
Why Wait? Sometimes That’s the Only Choice
“All good things arrive unto them that wait - and don't die in the meantime.”
- Mark Twain
Memory is always a tricky thing. And we’re not just talking about trying to find the TV remote for five minutes only to finally discover that it’s in the fridge next to the five-pack. Don’t judge until you’ve done your 1.6 km in a certain pair of size 14’s, thank you. Anyway, Your Humble Author (YHA) remembers many a discussion on system memory, cache topologies, drive characteristics, and all those other fun things that are associated with an engineer’s most-dreaded, four-letter “L” word: latency. During one particular discussion with a senior software VP at a large OS company, he opined, “All processor architectures wait equally fast.” So, so true.
Pretty much every one of those computer architectures arranges the logic units to operate on a series of data registers. We promise this is going somewhere… and we’re also simplifying a lot so everyone can grab a bone and pick. Data registers load and store from memory. Some architectures will also allow for direct memory operations. But which memory? This is where waiting comes to play.
The following analogy has been changed to protect the innocent, but YHA thanks the smart people for the concept. Let’s say the desired memory is stored in the nearest cache (with typical latencies in the one nanosecond range these days). Let’s also say that acquisition is the equivalent of finding the TV remote again next to the three-pack. Easy. Processor architectures typically have a series of larger caches or on-chip memories these days, followed by DRAM on the motherboard. It could take a walk to the neighbor’s house for a quick wave while opening the fridge in the garage to accomplish that particular DRAM “fetch.”
If that particular data happens to be stored on some of the latest and best solid-state memory on a coherent bus, the equivalent would be a short drive to the local convenience store. Leave the remote at home, please. A slightly older SSD configuration – or a poor driver routine (processor people always blame the software) – could mean a slightly longer trip to the grocery store in town. The average AI cluster data transfer across the mesh would involve stopping for a sit-down dinner before heading home. That’s a lot of work for one individual piece of data, all while the particular system thread is waiting.
One more. The best network topologies in the datacenter these days usually guarantee a maximum of 5 milliseconds latency from any server to your register. THE COMPANY where YHA used to work offered the promise of an 8-week sabbatical. Were that entire sabbatical devoted to the acquisition of one beverage unit, we’d be in the ballpark of the latency we’re discussing. That sucker better not be an over-hopped IPA. And what’s the system thread doing that whole time? To use the Yiddish: bupkis.
Clearly, those decades of work between hardware and software geeks on multi-threading, data-ordering, data promotion, appropriate memory and SSD sizing, (keep going, it was decades) were all pretty vital. The modern datacenter rack topology is now largely being driven by collaborative, hive-mind organizations like the Open Compute Project (OCP). And even with all that effort, many of those cores are still essentially thirsting for that next drink of data. There’s a very cogent person a few TechArena articles away from this one that noted it can be very disappointing to pay $1B for some AI hardware only to see $500M of work happen. These are not inconsequential decisions.
So what to do? First, don’t panic (towel and exploding planet optional). There are plenty of resources out there that can help optimize whatever you’re building. Look for similar organizations in your region or technology circle and ask questions, many will be very open if they’re not competing with you, some even if they are. Look to organizations like OCP as a guide for setting a proper configuration. Also, your software providers likely have a much better view of configurations that best run their work, mostly because of those decades of work. Finally, look to cloud options, since your provider will then be taking the risks.
And keep the analogy going. Maybe in the flat world of data analytics we’ll collectively find out what data fetch has us all going to the moon and back. By the way, is anyone else thirsty? And where the heck is the remote?