One of the most confusing driver bugs is “the data is clearly in memory, but the other side cannot see it.”
The CPU prepares a transmit buffer, but the network device sends old data. DMA has written received data into memory, but the driver or application still reads stale values. Adding logs makes the issue disappear; changing optimization brings it back.
These bugs are often not because DMA failed or the pointer is wrong. The problem is that CPU cache, DMA device, and memory visibility were not handled correctly.
The safest first model is this: the CPU usually accesses memory through caches, while a DMA device may access memory directly. The data seen by the CPU cache and the data seen by the device in physical memory may not be the same version at a given moment.
CPU <-> Cache <-> Memory
DMA Device -------> Memory
CPU and device do not automatically share one instant live view of memory.
What DMA Solves
DMA lets a device read from or write to memory without the CPU moving every byte.
Network receive, audio capture, camera frames, disk I/O, and high-speed SPI transfers may all use DMA.
Without DMA, the CPU may repeatedly do:
read one byte from device register
write it to memory
read next byte
write it to memory
With DMA, the CPU mainly prepares descriptors, buffers, and control registers, then the device moves data:
CPU configures DMA address and length
-> device reads from memory or writes to memory
-> device interrupts on completion
-> CPU handles the result
DMA improves throughput and reduces CPU load. But it introduces a new problem: both device and CPU access memory, and they may not see the same version.
Why CPU Cache Creates an Illusion
The CPU usually does not access DRAM directly every time. It keeps frequently used data in cache.
That creates two common problems.
First, the CPU wrote data, but it has not reached memory yet.
CPU writes transmit buffer
-> new data stays in cache
-> physical memory still has old data
-> DMA device reads old data from memory
Second, DMA wrote memory, but the CPU still reads old cache.
DMA writes receive buffer
-> physical memory has new data
-> CPU cache still has old data
-> CPU reads stale value
So “same memory address” does not mean both sides see the same content immediately. Cache makes the CPU faster, but it turns visibility into an explicit driver responsibility.
Why Transmit Direction Needs Clean
In transmit direction, the CPU prepares data and the device reads it from memory through DMA.
The risk is that CPU changes are still dirty in cache and have not been written back to memory that the device can see.
Before handing the buffer to the DMA device, the driver usually needs cache clean: write dirty cache lines back to memory.
CPU writes tx buffer
-> clean cache
-> memory has latest data
-> device DMA reads memory
If this is missing, the device may read an old packet, old audio sample, old command, or half-updated data.
This class of bug is subtle. Sometimes the cache line happens to be written back naturally, so the issue disappears. Logs, delays, and compiler options can all change reproduction.
Why Receive Direction Needs Invalidate
In receive direction, the device writes memory through DMA, and the CPU reads the result.
The risk is that the device has written new data to memory, but the CPU still has an old cached copy.
Before the CPU reads DMA-written data, the driver usually needs cache invalidate: invalidate the corresponding cache lines and force the CPU to reload from memory.
device DMA writes rx buffer
-> invalidate cache
-> CPU reloads from memory
-> CPU sees latest data
If this is missing, the driver may keep reading old packet headers, old status, old samples, or conclude that the device did not write anything.
Ordering matters too. The CPU should read only after DMA completion is confirmed. Otherwise it may read a partially written state.
Why Cache-Line Alignment Matters
Cache maintenance usually works on cache lines, not arbitrary byte ranges.
If a DMA buffer is not cache-line aligned, or if unrelated objects share the same cache line, extra risks appear.
For example, one cache line may contain:
[ordinary variable][part of DMA receive buffer]
If the driver invalidates that cache line to read DMA data, it may also discard a dirty ordinary variable that the CPU has not written back.
Conversely, cleaning a cache line that contains both a DMA buffer and other data may write back stale or unintended bytes to memory visible to the device.
DMA buffers usually need:
- address aligned to cache-line or platform requirements
- length handled at cache-line granularity
- no unrelated ordinary variables sharing the same cache line
- no arbitrary CPU reuse during DMA lifetime
These details look tedious, but they decide whether cache maintenance is safe.
A DMA Address Is Not Always a CPU Pointer
Another common driver mistake is treating a CPU pointer as the device DMA address.
On systems with an MMU, IOMMU, or complex bus fabric, user virtual addresses, kernel virtual addresses, physical addresses, and DMA addresses may all be different.
User virtual address: pointer seen by application
Kernel virtual address: address used by kernel code
Physical address: location in memory
DMA address: bus address seen by device or IOMMU
A device usually does not understand user-space virtual addresses. Even if the kernel can access a buffer, the device may not be able to access it using the same numeric value.
Drivers usually use DMA mapping APIs to turn memory into a device-usable DMA address and unmap it later.
That step may handle address translation, cache synchronization, and IOMMU permissions.
Coherent DMA and Streaming DMA Are Different
Many systems distinguish two DMA memory models.
Coherent DMA memory means the platform guarantees coherency between CPU and device for that memory. The driver usually does not need manual clean/invalidate on every access.
It fits descriptor rings and control blocks that CPU and device both touch frequently.
Streaming DMA mappings are more common for one-shot or staged transfers. The CPU owns the buffer in one phase, the device owns it in another, and ownership changes require map/unmap or sync operations to handle cache visibility.
CPU prepares buffer
-> sync/map for device
-> device DMA
-> sync/unmap for CPU
-> CPU handles result
The tradeoff differs:
- coherent memory is convenient, but may be scarce or have different performance
- streaming DMA is flexible, but synchronization timing must be precise
Drivers fail when they mix the semantics: assuming memory is automatically coherent when it actually requires manual sync.
Memory Barriers Do Not Replace Cache Writeback
Memory barriers are often confused with cache maintenance.
A memory barrier constrains the ordering observed by CPU, compiler, or bus, ensuring certain reads and writes are not reordered past a point.
For example, when writing DMA descriptors, the driver may need:
write descriptor contents first
then write doorbell register
If the order is wrong, the device may receive the notification before the descriptor is ready.
But a memory barrier is not cache clean. It does not automatically write dirty cache data back to memory, and it does not automatically invalidate stale cache lines.
A useful separation is:
cache clean/invalidate: make CPU cache and memory contents coherent
memory barrier: constrain memory access ordering and visibility order
DMA paths often need both, but they solve different problems.
What to Check During DMA Corruption
When DMA data is stale, occasionally wrong, fixed by logs, or different across platforms, check these layers first.
First, is the address layer correct? The device needs a DMA address, not a user pointer or ordinary kernel virtual address.
Second, is direction handled correctly? CPU-to-device needs clean; device-to-CPU needs invalidate.
Third, is sync timing correct? Sync for device before DMA starts; sync for CPU after DMA completes.
Fourth, is the buffer aligned? Address and length must satisfy cache-line, DMA controller, and platform requirements.
Fifth, is lifetime clear? Is the CPU modifying the buffer during DMA? Was the buffer freed or reused too early?
Sixth, is descriptor ordering correct? Descriptor writes, data writes, doorbell writes, and interrupt handling may need memory barriers.
Seventh, is the platform hardware-coherent? Some platforms are DMA coherent, others are not. Drivers should not rely on luck.
These questions are more useful than assuming “DMA is broken.”
What to Remember in Practice
DMA lets devices access memory without CPU copying. Cache lets the CPU avoid direct memory access on every load or store.
Together they create visibility problems:
- CPU writes new data, but the device may still see old memory
- device writes new data, but the CPU may still read old cache
- a CPU pointer is not necessarily a DMA address the device can use
- cache maintenance works at cache-line granularity, so alignment matters
- memory barriers constrain ordering, not cache clean/invalidate
Handling DMA in a driver is not just writing address and length into registers. The important part is defining ownership, visibility, and ordering between CPU, cache, memory, and device.