Why Cache, Memory Barriers, and DMA Often Break Drivers

Reading time: 8 minute Word count: 1621

CPU Architecture Cache Memory Barriers DMA Drivers

Some driver bugs feel almost random.

The CPU has written a descriptor, but the device reads old contents. The DMA completion interrupt has fired, but the driver still reads stale buffer data. Adding one log line makes the bug disappear. Changing optimization brings it back. It worked on a single-core MCU, then fails occasionally on an SoC with cache.

These bugs are often not caused by “broken DMA” or “an aggressive compiler.” They happen because several different guarantees were mixed together.

The safest first model is this: cache maintenance handles data-version visibility between CPU cache and memory; memory barriers handle access ordering; atomics and locks handle concurrent CPU execution; DMA mapping handles device-visible addresses and buffer ownership.

cache clean / invalidate: data version
memory barrier: access ordering
atomic / lock: CPU concurrency
DMA mapping: device address and buffer ownership

These mechanisms often appear together, but they cannot replace one another.

Cache Lets the CPU and Device See Different Versions

CPU memory accesses usually pass through cache. A DMA device may bypass CPU cache and access memory directly, or access a coherent domain through a bus.

Two directions cause most problems:

Transmit:
CPU writes tx buffer
-> new data stays in cache
-> memory still has old data
-> DMA device reads old data

Receive:
device DMA writes rx buffer
-> memory has new data
-> CPU cache still has old data
-> CPU reads stale value

So “same address” does not mean “same contents are visible.” CPU, cache, memory, and DMA devices can observe different versions.

Cache clean writes dirty CPU cache lines back to memory so the device can read them. Cache invalidate discards stale CPU cache lines so the CPU reloads data written by the device.

This solves data-version visibility. It does not solve ordering or mutual exclusion.

Transmit Usually Needs Clean Before Device Ownership

In the transmit direction, the CPU prepares data and the device reads it through DMA.

If CPU writes remain in cache, the device may read old memory contents. Before handing a buffer to the device, the driver must make the contents visible to the device.

CPU fills tx buffer / descriptor
-> clean cache or DMA sync for device
-> write device doorbell / start
-> device performs DMA read

Missing clean or sync-for-device can produce:

network card sends old packets
audio plays old samples
device reads old descriptors
commands occasionally become invalid
adding logs hides the issue because cache lines are disturbed

Descriptors and data buffers may be separate objects. The device may read a descriptor first, then follow the address inside it. Both must satisfy visibility and ordering requirements.

Receive Usually Needs Invalidate After Completion

In the receive direction, the device writes memory through DMA and the CPU later reads the result.

If the CPU has an old cache line, it may keep reading old data after the device has updated memory. After DMA completion, the CPU must discard stale cache state before reading.

device DMA writes rx buffer
-> completion interrupt or status bit
-> invalidate cache or DMA sync for CPU
-> CPU reads result

Invalidating too early or too late can both break things. Too early, later CPU access may bring stale data back. Too late, the CPU may already have read stale data. Reading before DMA completion may observe a partially written buffer.

Receive buffers also require ownership discipline. While the device owns the buffer, the CPU should not modify it arbitrarily. After ownership returns to the CPU, reads and writes have defined meaning again.

Cache-Line Granularity Can Damage Neighbor Data

Cache maintenance usually works at cache-line granularity, not exact byte ranges.

If a DMA buffer is not cache-line aligned, or shares a cache line with normal variables, cache maintenance can affect neighboring data.

Example:

[normal variable A][part of DMA rx buffer]  // same cache line

Invalidating this cache line to read DMA results may also discard a CPU write to variable A that has not been written back. Cleaning the same line for DMA transmit may write back unrelated data in the same line.

DMA buffers usually need:

address aligned to platform requirements
length handled according to cache-line or DMA API rules
no sharing of cache lines with unrelated variables
clear ownership for the buffer lifetime

This is not performance neatness. It is correctness.

Memory Barriers Solve Ordering, Not Cache Writeback

Memory barriers constrain access ordering and visibility ordering.

A common driver case is: prepare a descriptor, then notify the device.

write descriptor fields
-> memory barrier
-> write doorbell register

The barrier prevents the CPU, compiler, or bus from making the device notification visible before descriptor preparation. When the device sees the doorbell, the descriptor should already be visible according to the platform rules.

But a barrier is not cache clean. It does not automatically write dirty cache lines to memory, and it does not invalidate stale cache lines.

Simple distinction:

cache clean/invalidate: data version
memory barrier: access ordering

Many DMA paths need both. A barrier without clean may still let the device read old data. Clean without ordering may let the device see a start notification before the descriptor is ready.

volatile Is Not a DMA Synchronization Tool

volatile is often incorrectly used to fix DMA or multicore visibility problems.

Its main job is to tell the compiler not to omit, merge, or keep this access only in a register. It is useful for MMIO register access.

But volatile does not guarantee:

cache writeback
cache invalidation
cross-CPU visibility ordering
DMA buffer ownership transfer
read-modify-write atomicity
mutual exclusion

Declaring a DMA buffer volatile does not make the device see new data sitting in CPU cache, and it does not make the CPU discard old cache lines. Declaring a shared flag volatile also does not replace release/acquire, locks, or memory barriers.

If volatile appears to fix the bug, it likely changed code generation or timing and hid the real synchronization problem.

Atomics and Locks Solve CPU Concurrency

Atomic operations ensure a read-modify-write operation is not observed halfway, and may provide acquire/release ordering. Locks provide mutual exclusion and usually imply required ordering for a critical section.

They solve concurrency among CPU execution contexts:

multiple threads modifying one counter
ISR and task sharing state
multicore data and flag publication
protecting data structures in critical sections

But atomics and locks do not automatically handle DMA cache-version problems. A thread may safely write data into a buffer, but that does not mean the device can see those writes in memory. A lock protects CPU-side access; the device is not a normal thread that participates in the same lock.

Drivers often need two synchronization categories:

among CPU contexts: locks, atomics, queues, wait/wakeup
between CPU and device: DMA mapping, cache sync, MMIO barriers, doorbell ordering

Mixing these categories leads to code that is “locked” but still wrong.

DMA Mapping Handles Addresses and Ownership

On systems with an MMU, IOMMU, or complex bus topology, a CPU pointer is not necessarily a device-usable address.

Drivers commonly see:

user virtual address
kernel virtual address
physical address
DMA address
I/O virtual address

DMA mapping APIs do more than compute an address. They often express:

which device receives the buffer
direction: to-device, from-device, or bidirectional
the DMA address the device can use
whether an IOMMU mapping is needed
whether cache synchronization is needed
when ownership transfers between CPU and device

So correct DMA code usually does not cast a pointer to an integer and write it to a register. It uses platform APIs to map, synchronize, start the device, then synchronize or unmap after completion.

Coherent Memory Still Has Rules

Some platforms support coherent DMA memory. For this memory, hardware or the platform maintains coherency between CPU and device, so the driver usually does not manually clean or invalidate every transfer.

It is often used for:

DMA descriptor rings
control blocks
small status structures frequently shared by CPU and device

But coherent does not mean no ordering problems, and it does not remove the device protocol.

Before a device reads a descriptor, the CPU must still ensure field writes and doorbell ordering. After the device writes status, the CPU must still follow the device completion protocol. Multicore CPU access to shared structures may still require locks or atomics.

Coherent memory solves one kind of cache data-version problem, not every synchronization problem.

Debug by Identifying the Missing Guarantee

DMA and cache bugs can be classified by the missing guarantee:

device reads old data: did CPU-to-device clean or sync-for-device happen?
CPU reads old data: did device-to-CPU invalidate or sync-for-CPU happen?
device starts too early: is a barrier missing between descriptor writes and doorbell?
neighboring variable is corrupted: is the DMA buffer cache-line aligned and isolated?
code is locked but still wrong: does it also need DMA cache sync?
pointer written to device is invalid: does the device need a DMA address, not a CPU virtual address?
multicore failure is intermittent: are atomic, barrier, and lock ordering correct?
failure after low power: were cache, IOMMU, DMA controller, and device state restored?

This classification is more reliable than “try adding volatile.”

What to Remember

Cache, memory barriers, atomics, locks, and DMA mapping all deal with who can see what, in what order, and whether multiple agents can modify it at the same time. But they operate at different layers.

Cache clean/invalidate handles data versions between CPU cache and memory. Memory barriers handle access ordering. Atomics and locks handle synchronization among CPU execution contexts. DMA mapping handles device-visible addresses, IOMMU state, and buffer ownership.

Driver data corruption often comes from mixing these guarantees. Good debugging does not start with logs, volatile, or random barriers. It starts by asking what is missing now: data visibility, access ordering, mutual exclusion, or device address and ownership.