Operating Systems

22 Posts

Why Clock Trees Affect CPUs, Buses, and Peripherals

10 minute

Many peripheral problems do not look like clock problems at first.

UART baud rate is slightly wrong. SPI fails when the rate is raised. I2C occasionally times out. A timer period drifts. ADC sampling does not match the expected rate. The CPU runs at a high frequency, but peripheral register access is still slow. A more subtle case: everything works before sleep, but the first peripheral access after wakeup fails, or a UART works in the bootloader and then stops working once the operating system takes over.

Read More

Why Device Upgrade and Rollback Are System Engineering

8 minute

The worst remote-device upgrade failure is not “the upgrade failed.” It is “the upgrade failed and the device never comes back.”

Many IoT devices are deployed in the field. Engineers cannot easily open them, flash them manually, or attach a serial console. If one OTA update corrupts the boot slot, migrates configuration irreversibly, or switches to a new system that cannot connect back to confirm success, the device may become unrecoverable remotely.

Read More

How Memory Barriers, Atomics, and volatile Differ

7 minute

Some embedded and systems bugs are hard to reproduce: adding a log makes them disappear, disabling optimization hides them, changing CPU architecture exposes them, or multicore load makes them fail occasionally.

The first reaction is often to add volatile.

But volatile is not a thread synchronization primitive. It is not a memory barrier, and it is not a lock. It can prevent the compiler from optimizing away certain accesses, but it does not guarantee multicore visibility order, and it does not turn x++ into an atomic operation.

Read More

Why I/O Multiplexing and Event Loops Are Common

8 minute

Many Linux services eventually become event loops: network sockets, pipes, timerfd, eventfd, device nodes, and signal notifications all enter a select, poll, or epoll loop.

This is not because epoll is a more advanced read(). It solves a more basic problem: one thread cannot block on many read() calls at the same time.

If a program has only one socket, blocking read() is natural. Once it must handle hundreds of connections, one control pipe, several timers, and a device fd, it cannot let the thread get stuck on any single object.

Read More

What Crash Evidence and Logs Should Preserve

8 minute

The hardest field failures are often not “the device crashed,” but “the device crashed, rebooted, and left nothing useful behind.”

After reboot, everything may look normal. Services restart, the network reconnects, logs begin from the new boot. Users only know the device was offline. Engineers have to guess: application crash, kernel panic, watchdog reset, power loss, voltage dip, or an external MCU resetting the main processor?

The goal of crash evidence is not to save every log line. Useful evidence should be small enough, reliable enough, and specific enough to let the next boot identify the failure type, failure location, system state, and recovery path.

Read More

What Happens Behind Device Sleep and Wakeup

8 minute

IoT devices often need to save power. When the screen is off, the network is idle, or sensor sampling is infrequent, the system wants to enter a low-power state.

From the application point of view, sleep can look like “pause for a while and continue when an event arrives.” Inside the system, much more happens.

Device sleep is not just pausing the CPU. The system may stop CPU cores, lower frequencies, gate peripheral clocks, cut power domains, save register state, freeze user processes, run driver suspend callbacks, and leave only a small set of wake sources enabled.

Read More

Why a Watchdog Is More Than Rebooting a Frozen System

8 minute

Many devices have a watchdog. The common explanation is simple: if the system freezes, the watchdog times out and reboots it.

That is not wrong, but it is too shallow.

A useful watchdog is not merely a timed reset mechanism. It asks a more specific question: are the critical paths that must keep making progress actually still making progress?

If the watchdog is fed from the wrong place, the business thread may be deadlocked while the system still feeds the watchdog on time forever. If the timeout is too short for real scheduling and I/O behavior, the system may reset even though it could have recovered normally.

Read More

Why Page Cache Makes File I/O Different From Direct Disk Access

8 minute

Linux devices often show behavior that looks odd at first: a large file is much faster to read the second time; write() returns quickly after writing data; free shows less available memory even though the system is not really leaking memory.

The same mechanism is often behind all of these observations: Page Cache.

Page Cache is the kernel layer that caches file contents in memory. It prevents file reads and writes from touching much slower storage on every operation. On reads, the kernel can check the cache first. On writes, the kernel can update cached pages first and write them back to storage later.

Read More

How Stack, Heap, and Memory Layout Divide the Work

8 minute

When a program crashes, logs may mention stack overflow, segmentation fault, out of memory, or heap corruption. These are all memory-related, but they are not the same kind of failure.

Stack, heap, globals, text, and mmap regions are not merely “different memory blocks.” They serve different lifetimes, access patterns, and runtime constraints.

A useful first model is: text stores instructions, data stores global state, the stack stores function calls and local execution state, the heap stores dynamically allocated objects, and mmap regions store file mappings, shared memory, large anonymous mappings, and dynamic libraries.

Read More

Why Timers and Clocks Affect Timeout Behavior

7 minute

Engineering code often contains calls like these:

wait_event_timeout(..., 1000);
sleep(1);
select(fd + 1, &rfds, NULL, NULL, &tv);

They all look like “wait for a while.” But waiting inside an operating system does not mean the CPU sits still and counts time.

A timeout usually passes through several steps: the program submits a wait request, the kernel places the current thread on a wait queue, a timer records the latest wakeup time, the scheduler gives the CPU to another thread, hardware clocks or timer interrupts advance time, and the thread is woken when the condition is satisfied or the timeout expires.

Read More