Why a Watchdog Is More Than Rebooting a Frozen System

Reading time: 8 minute Word count: 1702

Operating Systems Watchdog Reliability Embedded Recovery

Many devices have a watchdog. The common explanation is simple: if the system freezes, the watchdog times out and reboots it.

That is not wrong, but it is too shallow.

A useful watchdog is not merely a timed reset mechanism. It asks a more specific question: are the critical paths that must keep making progress actually still making progress?

If the watchdog is fed from the wrong place, the business thread may be deadlocked while the system still feeds the watchdog on time forever. If the timeout is too short for real scheduling and I/O behavior, the system may reset even though it could have recovered normally.

A useful first model is: a watchdog is an independent timer, and the system must prove health within a defined window; if that proof fails, the watchdog triggers reset, interrupt, or another recovery action.

critical tasks run
-> update health state
-> watchdog manager checks state
-> feed hardware watchdog within window
-> feeding stops or is invalid
-> watchdog triggers reset or alarm

The hard part is not writing a register to feed the watchdog. The hard part is defining what healthy means.

Hardware and Software Watchdogs Have Different Jobs

A hardware watchdog is usually an independent timer inside the chip or in external circuitry. Once enabled, if software does not refresh it within the allowed time, it can trigger reset, NMI, interrupt, or power-control action.

Its value is independence. Even if the CPU runs away, the kernel deadlocks, or the scheduler stops working, the hardware can still recover the system if feeding stops.

A software watchdog runs inside the system and can check tasks, threads, services, event loops, or business state. It can make finer judgments, such as:

whether a thread has missed heartbeats for too long
whether an event queue keeps growing
whether a state machine is stuck in one state
whether network, storage, or driver requests never complete
whether a high-priority task starves lower-priority work

A software watchdog understands business state better, but depends on the system still running. A hardware watchdog is coarser, but more independent.

Reliable devices often combine them: software checks system health, and only a controlled path feeds the hardware watchdog.

Feeding Location Determines What Can Be Detected

If a timer interrupt feeds the watchdog unconditionally, it only proves that the timer interrupt still runs. It does not prove task scheduling, application logic, filesystem progress, or networking is healthy.

If the main loop feeds it at the end of each iteration, it proves that the main loop reaches that point. It may say nothing about other tasks.

If a background thread feeds it on a fixed period, it only proves that this thread gets scheduled.

So feeding location is not a place to casually drop kick_watchdog().

A more robust design lets critical paths report health separately, then a watchdog manager aggregates them:

communication task heartbeat
storage task heartbeat
control task heartbeat
main state-machine progress counter
-> watchdog manager checks
-> feed hardware watchdog only if all conditions pass

The watchdog then checks not “some code is still running,” but “the critical parts of the system are still making progress.”

A Heartbeat Is Not Health

Many systems ask tasks to update a periodic heartbeat. If the watchdog manager sees the heartbeat change, it considers the task healthy.

This is better than unconditional feeding, but still incomplete.

A task can update its heartbeat while the business logic is stuck:

a thread loops while retrying the same failure
a state machine keeps oscillating between bad states
an event loop is alive while the queue backlog grows
the network thread is alive while every request times out
the control task is alive while sensor data has stopped updating

Health checks should not only ask whether a thread moved. They should ask whether critical state advanced.

Useful signals include:

whether a progress counter changes
whether queue length stays above a threshold
how long since the last successful I/O
whether a state machine remains in an abnormal state too long
whether a control loop misses its deadline
whether recovery keeps failing

A watchdog should detect unrecoverable stagnation, not merely confirm that a loop is spinning.

Too Short a Timeout Causes False Resets

Shorter watchdog timeouts are not always better.

If the timeout is shorter than real worst-case latency, false resets appear. Influencing factors include:

boot-time initialization
flash erase and filesystem mount
network reconnect and certificate validation
high-priority CPU load
interrupt storms or long driver paths
low-power wakeup and clock recovery
OTA upgrade and data migration

For example, if a flash erase takes hundreds of milliseconds and the watchdog window is 100 ms while the normal feeding path cannot run during erase, the device will reset reliably.

That is not simply “the watchdog is too sensitive.” It means the health window does not cover the real worst-case path.

Different phases often need different policies: boot, upgrade, normal operation, and low-power operation may require different timeout windows and feeding conditions.

Too Long a Timeout Loses Value

A timeout that is too long also causes problems.

If a device waits 10 minutes before resetting after a real stall, user experience, business loss, and remote recoverability may all suffer. For control devices, staying in a failed state too long can create safety risk.

So watchdog design balances two goals:

avoid false resets during normal long paths
avoid leaving real failures in place too long

This usually starts by defining failure levels.

For example, a UI stall for 3 seconds may restart a service, networking unavailable for 60 seconds may reconnect, a control loop missing a deadline may enter a safe state, and a complete scheduling stop should be left to the hardware watchdog.

Not every anomaly should reboot the whole machine.

A Window Watchdog Can Detect Feeding Too Early

Some hardware provides a window watchdog: feeding is only valid inside a defined time window. Feeding too late resets the system, and feeding too early also resets it.

This detects another class of failure. If a program runs away into a wrong fast loop that still executes the feed instruction, a normal watchdog may never time out.

A window watchdog requires feeding at a reasonable rhythm:

too early: error
inside window: allowed
too late: timeout

This improves detection, but makes timing stricter. Scheduling jitter, low-power modes, long I/O, and interrupt-disabled sections can all push feeding outside the window.

When using a window watchdog, the feeding task’s period, priority, and worst-case execution path must be clear.

Preserve Evidence Before and After Reset

If the watchdog resets the device and no evidence is kept, the reboot only says “something restarted.”

Useful systems keep enough state to diagnose the cause:

reset reason
watchdog trigger count
last feed time
last heartbeat for each task
key state-machine state
current error code or error counters
lightweight crash log
tail of a persistent ring log

Some hardware watchdogs do not leave enough time to write logs at the moment of reset, so low-cost state must be maintained continuously rather than collected only after timeout.

On Linux systems, hardware watchdogs can be combined with kernel logs, pstore/ramoops, systemd watchdog, service restarts, and boot-time recovery scripts. On RTOS or bare-metal systems, common tools include reset reasons, retained RAM, backup registers, and compact fault codes.

The Feeding Thread Can Lie

A common mistake is letting an independent low-priority thread feed the watchdog periodically without checking other tasks.

This misses many failures.

If a high-priority task spins forever, a low-priority feeding thread may starve, and the hardware watchdog resets. That failure can be detected.

But if the feeding thread has high priority, it may continue running while most business tasks are deadlocked, and it will keep feeding the watchdog. That failure is missed.

If feeding happens in an interrupt, task-level deadlocks may also be missed.

The feeding path should intentionally depend on critical system capabilities: scheduling, task progress, queue consumption, I/O completion, or business-state movement. It must not be so independent that it no longer represents real health.

Linux Has Multiple Watchdog Layers

Linux systems often contain several watchdog layers:

hardware watchdog device such as /dev/watchdog
kernel soft lockup and hard lockup detection
systemd watchdog for service heartbeats
application-specific health checks
external MCU or power-management chip supervising the main processor

Their coverage is different.

An application watchdog can judge business state, but cannot handle a fully stuck kernel. systemd can restart services, but may not recover a driver deadlock. A hardware watchdog can reset the machine, but does not know which service failed first. An external MCU is more independent, but its communication and false-positive policy must be designed carefully.

Reliable design is not just enabling /dev/watchdog; it connects service-level recovery, system-level reset, and evidence collection.

How to Debug Watchdog Resets

When a device occasionally resets by watchdog, do not start by simply increasing the timeout.

Split the problem into layers.

First, identify the reset source: hardware watchdog, window watchdog, kernel lockup detector, systemd watchdog, or external power manager.

Second, find the last feeding path: feeding thread, interrupt, main loop, service heartbeat, or watchdog manager.

Third, identify which tasks stopped progressing before timeout. Check heartbeats, queues, state machines, last successful I/O, and scheduling state.

Fourth, check for normal long paths: boot, upgrade, flash erase, network connection, and low-power wakeup may exceed the window.

Fifth, check for false feeding. If business state is dead while the feeding path still runs, the health condition is too shallow.

Sixth, check whether evidence survives reset. Without reset reason and task state, the next analysis becomes guesswork.

These questions are closer to the cause than only asking “why did the watchdog reboot?”

What Matters in Practice

A watchdog is not for periodically rebooting a system. It is for moving the system into a recoverable path when it reaches unrecoverable stagnation.

The key design questions are not how to call the feed function, but:

which paths must be proven alive
what counts as healthy progress
which worst-case paths the timeout window must cover
what evidence survives reset
how service recovery, system reset, and external supervision divide responsibility

A casually fed watchdog only creates a feeling of safety. A well-designed watchdog should reset when the system really loses self-recovery, and avoid resetting during normal slow paths.