Making sense of latency numbers
Introduction
While preparing for system design interviews a few months ago, I realized something that caught me off guard: even as an experienced software engineer, my intuition about performance wasn’t as strong as I thought.
I know the usual rules of thumb - Cache is fast, Disk is slow, Network calls are expensive. I can recite the latency numbers every programmer is supposed to know, and I’ve used them in designs before. But under pressure, I noticed a gap. I knew the facts, yet I couldn’t feel the difference between nanoseconds, microseconds, and milliseconds in a way that consistently guided good design decisions.
Visualizing a nanosecond
What finally made things click wasn’t a benchmark or a whitepaper - it was a story about Grace Hopper, the Queen of Code:
They started talking about circuits that acted in nanoseconds. Billionths of a second. Well, I didn’t know what a billion was (I don’t think most of those men downtown knew what a billion is either). And uh… if you don’t know what a billion is, how on Earth do you know what a billionth is? I fussed and fumed. Finally one morning, in total desperation, I called over to the engineering building and I said: “Please cut off a nanosecond and send it over to me.” And I brought you some today.
Now, what I wanted when I asked for a nanosecond was: I wanted a piece of wire which would represent the maximum distance that electricity could travel in a billionth of a second. Now, of course, it wouldn’t really be through wire. It’d be out in space; the velocity of light. So, if you start with the velocity of light and use your friendly computer, you’ll discover that a nanosecond is 11.8 inches (30 cm / a foot) long (the maximum limiting distance that electricity can travel in a billionth of a second).
Finally, at the end of about a week, I called back and said: “I need something to compare this to. Could I please have a microsecond?”
And this is when things suddenly started to click for me. I guess because I’m a visual person, I needed visual aid to understand these latency numbers once and for all.
So, let’s go with some general facts for the purpose of this article:
- A building’s 100 ft.
- A city is about 6-10 Km radius
- A state is about 300 Km radius
And lastly, remember this: bandwidth is different from latency. Bandwidth is the amount of data that can be transferred over a network in a given time period, while latency is the time it takes for a signal to travel from one point to another.
Latency Numbers Everyone Should Know - Expressed again
Remember this order of performance of computer hardware:
CPU cache (L1 > L2 > L3) > Main memory (RAM) > SSD (NVMe > SATA) > HDD > Local network (10 Gbps LAN) > Internet
Nanosecond stuff - staying within the same building
- Within the same room:
- L1 cache reference, 1 ns, 1 ft.
- Branch misprediction, 3 ns, 3 ft., Move 1 ft. in wrong direction, move 1 ft. back to origin, move 1 ft. in the right direction
- L2 cache reference, 4 ns, 4 ft., 4x worse than L1
- Within the same house/apartment:
- Mutex lock / unlock, 16 ns, 16 ft., Standard height of a ceiling is 8-9 ft. 2 floors. Also starting to grasp the concept of spatial locality now - cache means staying in the same room, beyond that, we start moving to other rooms or other floors above or below.
- Main memory reference, 100 ns, 100 ft., 100x worse than L1. About half the height of Leaning Tower of Pisa (200 ft.).
Now, let’s move out of the building.
Microsecond stuff - staying within the same city
- Compress 1 kB with zippy, 2 µs, 2000 ft., About 10 leaning towers stacked on top of each other, not that they would ever be stable. And that’s a lot of stacking.
- 4kB Random Read from SSD, 20 µs, 20,000 ft., SSD is about 200x slower than RAM for random access.
Millisecond stuff - crossing state lines
- Round trip within same datacenter, 0.5ms
- Disk seek (a random access on HDD), 10ms, HDD is 500x slower than SSD for random access. Remember, this is a random access, not sequential. So, it is latency bound, not bandwidth bound.
- TCP packet round trip between continents, 150 ms, Light takes ~133 ms to cover the circumference of the Earth which is about 40,000 Km.
And this marks the end of latency numbers programmers should know about.
Sequential access
Let’s look at sequential access now.
Sequential access moves large, contiguous chunks of data. Once the initial access latency is paid, the system streams data at full speed, maximizing data transfer rates. This is common in streaming, video processing, or sequential file reads.
Let’s see how this works in modern computers.
The DMA revolution: How we stopped babysitting data transfers
In the early days of computing, the CPU had to do everything. Want to read a file from disk? The CPU would:
- Send a command to the disk
- Wait for the data
- Copy each byte from the disk controller to memory
- Repeat until done
This is called Programmed I/O (PIO), and it’s exactly as inefficient as it sounds. The CPU - your expensive, fast processor - spent most of its time acting as a glorified data mover, copying bytes one at a time while actual computation sat waiting.
Without DMA (Programmed I/O):
sequenceDiagram
participant CPU
participant Memory
participant Disk
CPU->>Disk: 1. Send read command
Disk->>Disk: 2. Seek to data
loop For each byte/word
Disk->>CPU: 3. Transfer byte to CPU register
CPU->>Memory: 4. Write byte to memory
end
CPU->>CPU: 5. Finally free to do actual work
See the problem? The CPU is stuck in that loop, unable to do anything else. For a 1 MB file with word-by-word transfers, that’s hundreds of thousands of interruptions.
Then came Direct Memory Access (DMA) in the 1960s. The idea was simple but revolutionary: what if we had a dedicated controller that could move data directly between devices and memory, without bothering the CPU for every byte?
With DMA:
sequenceDiagram
participant CPU
participant DMA as DMA Controller
participant Memory
participant Disk
CPU->>DMA: 1. "Read 1MB from disk to address 0x1000"
CPU->>CPU: 2. Goes off to do useful work
DMA->>Disk: 3. Send read command
Disk->>Disk: 4. Seek to data
Disk->>DMA: 5. Stream data
DMA->>Memory: 6. Write directly to RAM
DMA->>CPU: 7. Interrupt: "Transfer complete!"
The CPU kicks off the transfer with a single instruction, then walks away. The DMA controller handles the tedious byte-shuffling while the CPU does actual computation. When the transfer completes, the DMA controller taps the CPU on the shoulder with an interrupt.
Why this matters for sequential access:
DMA is why sequential reads can saturate your storage bandwidth. Without it, your blazing-fast NVMe SSD would be bottlenecked by how quickly your CPU can copy bytes - which, despite being fast, is nowhere near 3500 MB/s when it has other things to do.
Modern systems have evolved this further:
- Bus mastering: Devices can become temporary “masters” of the system bus
- Scatter-gather DMA: Transfer non-contiguous memory regions in one operation
- RDMA (Remote DMA): DMA across network connections, bypassing the remote CPU entirely
The takeaway? Sequential access is fast because the CPU gets out of the way. It sets up the transfer and lets dedicated hardware do the heavy lifting.
Looking at the above information, one can see that sequential access is primarily bandwidth-bound, meaning its performance is limited by the maximum rate at which the underlying storage medium (HDD, SSD, or RAM) can transfer data in a continuous, uninterrupted stream, often limited by the physical interface or controller capabilities rather than the time taken to locate the first byte (latency). Because data is read linearly, processors can prefetch, allowing the system to utilize the full memory bandwidth.
So, now the next section can more appropriately be titled Bandwidth numbers programmers should know.
Luckily for us, its quite easy to grasp bandwidth numbers. They’re literally advertised everywhere in computing, you just need to know how to make sense of them.
Again, here’s the ordering you can use: RAM > SSD > HDD > Network.
Sequential read from main memory (10 µs)
Take a DDR5-6400 dual-channel RAM. In RAM terminology:
- The “6400” is written as 6400 MT/s, which means it can transfer data at a rate of 6400 Million Transfers per second. Each transfer is 8 bytes per channel, the bandwidth is 6400 million transfers per second * 8 bytes = 51200 MB/s per channel.
- We got 2 channels, so the total bandwidth is 51200 MB/s * 2 = 102400 MB/s.
So, if you can read at a rate of 102400 MB/s sequentially from memory, then reading 1 MB sequentially from memory takes 102400 MB/s / 1 MB = 102400 µs / 1000000 ≈ 10 µs.
Well not quite accurate, but close enough for most practical purposes. RAM does incur some initial latency to access the first byte, which is typically around 100 ns (remember Main Memory Reference above?). However, once the first byte is accessed, the subsequent bytes can be read at the maximum bandwidth. Because this number is so small, it’s often ignored in practical applications.
Sequential read from SSD (0.2 - 1 ms)
I won’t kid here, SSDs are a complicated piece of hardware. For typical consumer NVMe SSD, sequential bandwidth is limited by:
- Host interface (PCIe lanes) and Controller: PCIe 3.0 x4
- NAND flash parallelism
Advertised sequential read: Upto 3500 MB/s
Let’s do our math: Time = Data / Bandwidth = 1 MB / 3500 MB/s = 0.3 ms
This doesn’t seem to match the number, right? Well SSDs have other things going within them:
- NVMe SSDs have a controller that manages the NAND flash memory and handles the data transfer.
- SSDs have a cache that can store frequently accessed data, reducing the number of reads from the NAND flash memory.
Even for sequential reads, SSDs pay:
- Command submission latency
- Controller scheduling
- NAND page access
- DMA setup
According to Google, typical NVMe read latency: ~80–150 µs. Let’s average it to 100 µs.
This brings our number to 20 µs (remember the 4 kB access latency on SSD?) + 0.3 ms = 0.32 ms.
Conceptually, SSDs are 100x slower than RAM for sequential access.
RAM is a firehose with almost no startup cost - unless you want to build a supercomputer. SSDs are a fast conveyor belt with a big “spin-up” delay.
Sequential read from disk (5 ms)
Take a typical 7200 RPM HDD like the Seagate Barracuda. HDDs advertise sequential read speeds of around 150–200 MB/s. Let’s use 200 MB/s.
Time = Data / Bandwidth = 1 MB / 200 MB/s = 5 ms
This assumes the read head is already positioned at the right track. For a fresh random seek, you’d first pay the seek time (~5 ms) and rotational latency (~4 ms for a 7200 RPM drive), but once the head is in place, data streams off the spinning platter at full bandwidth. That’s why sequential reads on HDD are surprisingly decent — it’s the seeking that kills you.
Conceptually, HDDs are 10x slower than SSDs and 500x slower than RAM for sequential access. The spinning platter is the bottleneck here — data can only be read as fast as the disk rotates under the head.
Sequential read from 1 Gbps network (10 ms)
Take your standard office or data center 1 Gbps Ethernet connection. The key here is bits vs bytes:
1 Gbps = 1,000 Mbps = 125 MB/s (divide by 8 to go from bits to bytes)
Time = Data / Bandwidth = 1 MB / 125 MB/s = 8 ms
But you never get the full 125 MB/s in practice. TCP/IP headers, Ethernet framing, and protocol overhead eat into the raw bandwidth, giving you an effective throughput of roughly 100–110 MB/s. That brings us closer to ~10 ms.
Conceptually, network is 2x slower than HDD and 1000x slower than RAM for sequential access. And this is on a local network — over the internet, you’re sharing bandwidth with congestion, routing hops, and your ISP’s promises.