How Memory Barriers, Atomics, and volatile Differ

Reading time: 7 minute Word count: 1477

Some embedded and systems bugs are hard to reproduce: adding a log makes them disappear, disabling optimization hides them, changing CPU architecture exposes them, or multicore load makes them fail occasionally.

The first reaction is often to add volatile.

But volatile is not a thread synchronization primitive. It is not a memory barrier, and it is not a lock. It can prevent the compiler from optimizing away certain accesses, but it does not guarantee multicore visibility order, and it does not turn x++ into an atomic operation.

A useful first model is: volatile controls whether the compiler emits an access; atomics control whether an operation is indivisible; memory barriers control ordering and visibility; locks control mutual exclusion and critical sections.

volatile: do not remove or merge this access
atomic: other execution flows cannot observe half of this operation
barrier: memory accesses across the barrier cannot be freely reordered
lock: only one execution flow enters the critical section at a time

They overlap, but they do not replace one another.

Both Compilers and CPUs Can Change Order

Source order is not the only order other execution flows may observe.

The compiler may reorder independent accesses, keep variables in registers, merge repeated reads or writes, and delete accesses it believes have no visible effect.

The CPU may also execute out of order, use store buffers, prefetch loads, or delay writes so another CPU or device observes a different order than the source code suggests.

For example:

data = 42;
ready = 1;

The programmer may expect another thread that sees ready == 1 to also see data == 42. Without synchronization, neither the compiler nor the CPU necessarily provides that cross-thread guarantee.

Concurrent code cannot build communication protocols only on “I wrote data before ready in the source.”

What volatile Actually Guarantees

The core purpose of volatile is telling the compiler that the object’s value may change in ways the compiler cannot see, so each access must be emitted as an access.

It is useful for cases such as:

memory-mapped registers
interaction with signal handlers or special runtime environments
preventing removal of externally visible accesses
polling hardware status registers in bare-metal code

Example:

volatile uint32_t *status = (volatile uint32_t *)STATUS_REG;

while ((*status & READY_BIT) == 0) {
}

Without volatile, the compiler might assume *status does not change inside the loop and read it only once.

But volatile does not guarantee:

multithreaded mutual exclusion
read-modify-write atomicity
memory access ordering
CPU cache coherence
happens-before between threads
safe shared data structures between tasks and interrupts

So using volatile int flag for thread synchronization is usually insufficient.

x++ Is Not Atomic

Many operations that look like one statement are multiple machine-level steps.

counter++;

It usually involves:

read counter
add 1
write counter back

If two threads execute this at the same time, both may read the old value and both write back the same new value, losing one increment.

volatile counter++ still does not fix this. It forces actual accesses but does not make read-add-write indivisible.

Atomic operations solve this class of problem. They make an operation appear indivisible to other execution flows, or provide explicit compare-and-swap, fetch-add, exchange, and similar semantics.

Atomicity answers whether an operation can be observed halfway or whether concurrent updates can be lost. It is not the same as mutual exclusion for a whole critical section.

Atomics Also Have Memory Order

In many languages and kernel APIs, atomic operations carry memory-order semantics.

Common semantics include:

relaxed: only atomicity for this variable, no extra ordering
acquire: later reads and writes cannot move before this operation
release: earlier reads and writes cannot move after this operation
acquire-release: both acquire and release semantics
sequentially consistent: a stronger global consistency order

A typical data publication model is:

producer writes data
producer release-writes ready
consumer acquire-reads ready
consumer reads data

The key is not only that ready is atomic. Release/acquire establishes ordering between threads: after the consumer sees ready, it also sees the data published before ready.

A relaxed atomic counter may be fine for statistics, but not for publishing a complex object.

Memory Barriers Control Order

A memory barrier, or fence, constrains ordering and visibility of accesses around the barrier.

Roughly:

read barriers constrain read ordering
write barriers constrain write ordering
full barriers constrain read and write ordering
compiler barriers constrain compiler reordering
CPU barriers constrain hardware observation order

For example, when a producer writes data and then notifies a device or another CPU:

write descriptor contents
write memory barrier
write doorbell register

The barrier ensures that when the other side sees the doorbell, descriptor contents are already visible as required.

Without the barrier, the other side may see the notification before seeing the updated descriptor.

A memory barrier is not a lock. It does not prevent two threads from entering the same code, and it does not protect a complex data structure from concurrent modification. It only constrains ordering.

Locks Usually Include Ordering Guarantees

In ordinary application code, the right tool is often not a hand-written barrier, but a lock, mutex, semaphore, condition variable, or queue.

Locks provide mutual exclusion and usually memory-ordering guarantees.

thread A:
lock
modify shared data
unlock

thread B:
lock
read shared data
unlock

After thread B acquires the same lock, it should see modifications protected by thread A before unlock.

That is why locks are the more common and easier-to-use synchronization tool. They may be implemented with atomics and barriers underneath, but callers do not need to write them everywhere.

Direct atomics and barriers are more common when locks are too heavy, lock-free data structures are needed, hardware ordering matters, low-level cross-CPU communication is implemented, or interrupt context restricts blocking.

Interrupts and Tasks Still Need Synchronization

A single-core MCU does not have multicore cache-coherence issues, but it still has concurrency.

An interrupt can preempt a task. If a task and ISR share variables without synchronization, bugs still happen:

task reads or writes a multi-byte value and is interrupted halfway
ISR modifies queue head/tail while task also modifies them
compiler keeps a flag in a register inside a task loop
task clears a flag while ISR sets it, losing an event

volatile can make the task reload a hardware or ISR-updated flag, but it does not protect complex data structures.

Common approaches include:

briefly disabling interrupts around critical sections
using atomic bit operations
using RTOS ISR-safe queues or semaphores
using a single-producer single-consumer ring buffer with explicit memory ordering
using platform-provided barriers around shared state

So “single core means no synchronization” is also wrong. Single core removes some problems, not concurrency itself.

Device Registers Need Dedicated Semantics

Driver code that accesses MMIO registers sometimes uses volatile in register definitions, but modern kernels usually prefer platform access APIs such as Linux readl() and writel().

Device register access is not only “do not optimize this away.”

It also involves:

access width
endianness
compiler reordering
CPU-to-device ordering
posted writes
read/write combining
bus semantics
ordering relative to normal memory

For example, writing a DMA descriptor and then writing a device doorbell register requires the descriptor to be visible to the device before the notification.

So drivers should not treat device registers as ordinary volatile int * pointers. Platform APIs often include the required access semantics and barrier constraints.

How to Debug These Bugs

When a bug disappears after adding logs, disappears with optimization off, fails only on multicore, loses events occasionally, or a driver reads stale state, split the layers.

First, decide whether this is a compiler optimization issue or a CPU/multicore visibility issue. volatile mainly addresses the former.

Second, decide whether atomicity is required. Can shared counters, state bits, or reference counts be modified by multiple execution flows?

Third, decide whether ordering is required. After observing a flag, must the reader observe data published before the flag?

Fourth, decide whether mutual exclusion is required. Do multiple fields form an invariant that must be protected together?

Fifth, check interrupt context. Can a task and ISR interleave on the same state?

Sixth, check whether device registers or DMA are involved. Normal memory, MMIO, DMA buffers, and cache maintenance are not the same layer.

These questions help decide whether to use volatile, atomic operations, barriers, locks, or driver/platform APIs.

What Matters in Practice

volatile, atomics, memory barriers, and locks are not different spellings of the same synchronization idea.

volatile mostly constrains compiler optimization of individual accesses. Atomic operations make specific operations indivisible and may carry memory order. Memory barriers constrain ordering and visibility. Locks provide mutual exclusion and usually include the necessary ordering guarantees.

For threaded, ISR, driver, and multicore code, first ask:

who can access this data concurrently
whether atomicity is needed
whether ordering visibility is needed
whether multiple fields need one invariant protected
whether the target is normal memory, a device register, or a DMA buffer

Answering those questions is far more reliable than adding volatile to every shared variable.