Reliability

3 Posts

Why Device Upgrade and Rollback Are System Engineering

8 minute

The worst remote-device upgrade failure is not “the upgrade failed.” It is “the upgrade failed and the device never comes back.”

Many IoT devices are deployed in the field. Engineers cannot easily open them, flash them manually, or attach a serial console. If one OTA update corrupts the boot slot, migrates configuration irreversibly, or switches to a new system that cannot connect back to confirm success, the device may become unrecoverable remotely.

Read More

What Crash Evidence and Logs Should Preserve

8 minute

The hardest field failures are often not “the device crashed,” but “the device crashed, rebooted, and left nothing useful behind.”

After reboot, everything may look normal. Services restart, the network reconnects, logs begin from the new boot. Users only know the device was offline. Engineers have to guess: application crash, kernel panic, watchdog reset, power loss, voltage dip, or an external MCU resetting the main processor?

The goal of crash evidence is not to save every log line. Useful evidence should be small enough, reliable enough, and specific enough to let the next boot identify the failure type, failure location, system state, and recovery path.

Read More

Why a Watchdog Is More Than Rebooting a Frozen System

8 minute

Many devices have a watchdog. The common explanation is simple: if the system freezes, the watchdog times out and reboots it.

That is not wrong, but it is too shallow.

A useful watchdog is not merely a timed reset mechanism. It asks a more specific question: are the critical paths that must keep making progress actually still making progress?

If the watchdog is fed from the wrong place, the business thread may be deadlocked while the system still feeds the watchdog on time forever. If the timeout is too short for real scheduling and I/O behavior, the system may reset even though it could have recovered normally.

Read More