Some embedded and systems bugs are hard to reproduce: adding a log makes them disappear, disabling optimization hides them, changing CPU architecture exposes them, or multicore load makes them fail occasionally.
The first reaction is often to add volatile.
But volatile is not a thread synchronization primitive. It is not a memory barrier, and it is not a lock. It can prevent the compiler from optimizing away certain accesses, but it does not guarantee multicore visibility order, and it does not turn x++ into an atomic operation.
A useful first model is: volatile controls whether the compiler emits an access; atomics control whether an operation is indivisible; memory barriers control ordering and visibility; locks control mutual exclusion and critical sections.
volatile: do not remove or merge this access
atomic: other execution flows cannot observe half of this operation
barrier: memory accesses across the barrier cannot be freely reordered
lock: only one execution flow enters the critical section at a time
They overlap, but they do not replace one another.
Both Compilers and CPUs Can Change Order
Source order is not the only order other execution flows may observe.
The compiler may reorder independent accesses, keep variables in registers, merge repeated reads or writes, and delete accesses it believes have no visible effect.
The CPU may also execute out of order, use store buffers, prefetch loads, or delay writes so another CPU or device observes a different order than the source code suggests.
For example:
data = 42;
ready = 1;
The programmer may expect another thread that sees ready == 1 to also see data == 42. Without synchronization, neither the compiler nor the CPU necessarily provides that cross-thread guarantee.
Concurrent code cannot build communication protocols only on “I wrote data before ready in the source.”
What volatile Actually Guarantees
The core purpose of volatile is telling the compiler that the object’s value may change in ways the compiler cannot see, so each access must be emitted as an access.
It is useful for cases such as:
- memory-mapped registers
- interaction with signal handlers or special runtime environments
- preventing removal of externally visible accesses
- polling hardware status registers in bare-metal code
Example:
volatile uint32_t *status = (volatile uint32_t *)STATUS_REG;
while ((*status & READY_BIT) == 0) {
}
Without volatile, the compiler might assume *status does not change inside the loop and read it only once.
But volatile does not guarantee:
- multithreaded mutual exclusion
- read-modify-write atomicity
- memory access ordering
- CPU cache coherence
- happens-before between threads
- safe shared data structures between tasks and interrupts
So using volatile int flag for thread synchronization is usually insufficient.
x++ Is Not Atomic
Many operations that look like one statement are multiple machine-level steps.
counter++;
It usually involves:
read counter
add 1
write counter back
If two threads execute this at the same time, both may read the old value and both write back the same new value, losing one increment.
volatile counter++ still does not fix this. It forces actual accesses but does not make read-add-write indivisible.
Atomic operations solve this class of problem. They make an operation appear indivisible to other execution flows, or provide explicit compare-and-swap, fetch-add, exchange, and similar semantics.
Atomicity answers whether an operation can be observed halfway or whether concurrent updates can be lost. It is not the same as mutual exclusion for a whole critical section.
Atomics Also Have Memory Order
In many languages and kernel APIs, atomic operations carry memory-order semantics.
Common semantics include:
- relaxed: only atomicity for this variable, no extra ordering
- acquire: later reads and writes cannot move before this operation
- release: earlier reads and writes cannot move after this operation
- acquire-release: both acquire and release semantics
- sequentially consistent: a stronger global consistency order
A typical data publication model is:
producer writes data
producer release-writes ready
consumer acquire-reads ready
consumer reads data
The key is not only that ready is atomic. Release/acquire establishes ordering between threads: after the consumer sees ready, it also sees the data published before ready.
A relaxed atomic counter may be fine for statistics, but not for publishing a complex object.
Memory Barriers Control Order
A memory barrier, or fence, constrains ordering and visibility of accesses around the barrier.
Roughly:
- read barriers constrain read ordering
- write barriers constrain write ordering
- full barriers constrain read and write ordering
- compiler barriers constrain compiler reordering
- CPU barriers constrain hardware observation order
For example, when a producer writes data and then notifies a device or another CPU:
write descriptor contents
write memory barrier
write doorbell register
The barrier ensures that when the other side sees the doorbell, descriptor contents are already visible as required.
Without the barrier, the other side may see the notification before seeing the updated descriptor.
A memory barrier is not a lock. It does not prevent two threads from entering the same code, and it does not protect a complex data structure from concurrent modification. It only constrains ordering.
Locks Usually Include Ordering Guarantees
In ordinary application code, the right tool is often not a hand-written barrier, but a lock, mutex, semaphore, condition variable, or queue.
Locks provide mutual exclusion and usually memory-ordering guarantees.
thread A:
lock
modify shared data
unlock
thread B:
lock
read shared data
unlock
After thread B acquires the same lock, it should see modifications protected by thread A before unlock.
That is why locks are the more common and easier-to-use synchronization tool. They may be implemented with atomics and barriers underneath, but callers do not need to write them everywhere.
Direct atomics and barriers are more common when locks are too heavy, lock-free data structures are needed, hardware ordering matters, low-level cross-CPU communication is implemented, or interrupt context restricts blocking.
Interrupts and Tasks Still Need Synchronization
A single-core MCU does not have multicore cache-coherence issues, but it still has concurrency.
An interrupt can preempt a task. If a task and ISR share variables without synchronization, bugs still happen:
- task reads or writes a multi-byte value and is interrupted halfway
- ISR modifies queue head/tail while task also modifies them
- compiler keeps a flag in a register inside a task loop
- task clears a flag while ISR sets it, losing an event
volatile can make the task reload a hardware or ISR-updated flag, but it does not protect complex data structures.
Common approaches include:
- briefly disabling interrupts around critical sections
- using atomic bit operations
- using RTOS ISR-safe queues or semaphores
- using a single-producer single-consumer ring buffer with explicit memory ordering
- using platform-provided barriers around shared state
So “single core means no synchronization” is also wrong. Single core removes some problems, not concurrency itself.
Device Registers Need Dedicated Semantics
Driver code that accesses MMIO registers sometimes uses volatile in register definitions, but modern kernels usually prefer platform access APIs such as Linux readl() and writel().
Device register access is not only “do not optimize this away.”
It also involves:
- access width
- endianness
- compiler reordering
- CPU-to-device ordering
- posted writes
- read/write combining
- bus semantics
- ordering relative to normal memory
For example, writing a DMA descriptor and then writing a device doorbell register requires the descriptor to be visible to the device before the notification.
So drivers should not treat device registers as ordinary volatile int * pointers. Platform APIs often include the required access semantics and barrier constraints.
How to Debug These Bugs
When a bug disappears after adding logs, disappears with optimization off, fails only on multicore, loses events occasionally, or a driver reads stale state, split the layers.
First, decide whether this is a compiler optimization issue or a CPU/multicore visibility issue. volatile mainly addresses the former.
Second, decide whether atomicity is required. Can shared counters, state bits, or reference counts be modified by multiple execution flows?
Third, decide whether ordering is required. After observing a flag, must the reader observe data published before the flag?
Fourth, decide whether mutual exclusion is required. Do multiple fields form an invariant that must be protected together?
Fifth, check interrupt context. Can a task and ISR interleave on the same state?
Sixth, check whether device registers or DMA are involved. Normal memory, MMIO, DMA buffers, and cache maintenance are not the same layer.
These questions help decide whether to use volatile, atomic operations, barriers, locks, or driver/platform APIs.
What Matters in Practice
volatile, atomics, memory barriers, and locks are not different spellings of the same synchronization idea.
volatile mostly constrains compiler optimization of individual accesses. Atomic operations make specific operations indivisible and may carry memory order. Memory barriers constrain ordering and visibility. Locks provide mutual exclusion and usually include the necessary ordering guarantees.
For threaded, ISR, driver, and multicore code, first ask:
- who can access this data concurrently
- whether atomicity is needed
- whether ordering visibility is needed
- whether multiple fields need one invariant protected
- whether the target is normal memory, a device register, or a DMA buffer
Answering those questions is far more reliable than adding volatile to every shared variable.