Why Device Upgrade and Rollback Are System Engineering

Reading time: 8 minute Word count: 1634

Operating Systems OTA Upgrade Rollback Reliability

The worst remote-device upgrade failure is not “the upgrade failed.” It is “the upgrade failed and the device never comes back.”

Many IoT devices are deployed in the field. Engineers cannot easily open them, flash them manually, or attach a serial console. If one OTA update corrupts the boot slot, migrates configuration irreversibly, or switches to a new system that cannot connect back to confirm success, the device may become unrecoverable remotely.

So device upgrade is not just downloading a package, overwriting files, and rebooting.

A useful first model is: reliable upgrade is a state machine across application, filesystem, bootloader, partition layout, configuration migration, and health checks. Every step must leave recoverable state after power loss, reboot, write failure, or a broken new version.

download update package
-> verify integrity and signature
-> write inactive slot or temporary area
-> mark next boot to new version
-> reboot into new version
-> new version passes health check
-> confirm upgrade success
-> rollback old version on failure

The most important goal is not “can upgrade succeed,” but “can any failed step preserve a path back.”

Why Overwriting the Current System Is Dangerous

The most obvious upgrade method is overwriting the running system with the new version.

This is risky.

During upgrade, many things can happen:

power loss
network interruption
storage write failure
corrupted package
filesystem out of space
processes still using old files
dependencies half-updated
reboot in the middle of the process

If the current system is overwritten halfway, the next boot may be neither old nor new. The bootloader may load the kernel, but rootfs is incomplete. The application may start, but libraries do not match. Configuration may be migrated, but the new program does not run.

Reliable OTA usually avoids destroying the currently working system. It writes the new version to another slot, temporary area, or verifiable image first, and switches only after it can be booted.

A/B Slots Preserve the Old Version

A/B partitioning is common in device upgrades.

The system has two slots:

slot A: currently running version
slot B: upgrade target

When the device boots from A, the updater writes the new version to B. After writing and verification, it only changes the boot flag so the bootloader tries B next time.

If B boots successfully and passes health checks, the system marks B as confirmed.
If B fails to boot, keeps rebooting, or never confirms success, the bootloader or upgrade manager rolls back to A.

The core value of A/B is that the old version remains available while the new version is being installed. On failure, at least one known-good system remains.

The cost is storage: more space is required, partition layout is more complex, and configuration/data partitions must be designed separately.

Bootloader Flags Decide the Next Boot

OTA is not only a user-space action. What actually boots is usually decided by the bootloader.

The bootloader needs to know:

current active slot
whether the new slot may be tried
how many tries remain
whether the last boot was confirmed successful
whether rollback is required
whether image verification passed

Common state includes:

active_slot = B
boot_try_count = 3
boot_success = false
rollback_slot = A

When the new system boots for the first time, it usually should not mark itself successful immediately. It should first pass minimal health checks: kernel up, rootfs usable, critical services running, network available, and status reported.

Only after those conditions are met should the user-space upgrade service write the “confirmed successful” flag.

If the device reboots before confirmation, the bootloader reduces the try count. Once tries are exhausted, it rolls back to the old slot.

Health Checks Must Be Deeper Than Process Started

Upgrade confirmation is often too shallow.

If the main process starting is enough to mark success, many failures are missed:

network cannot connect
device certificate or key is unavailable
data partition fails to mount
critical peripherals fail to probe
configuration migration fails
application runs but cannot reach backend
watchdog resets shortly after boot
new version is incompatible with the hardware revision

Health checks should match the product’s minimum usable definition.

For a connected device, that may include:

system completed boot
critical services are running
data partition is readable and writable
required drivers and device nodes exist
network connection is established
backend receives the new version status
watchdog and crash logs show no repeated failures

Confirming too early pins a bad version. Confirming too late may mistake normal slow boot for failure. This must be designed together with boot time, network environment, and low-power behavior.

Verify Before Crossing Trust Boundaries

An update package needs at least two kinds of verification.

The first is integrity verification: hash, length, or chunk checks. It answers whether the package downloaded completely and without corruption.

The second is authenticity verification: signature checking. It answers whether the package was produced by a trusted publisher.

If only hash is checked, an attacker can replace both package and hash.
If verification happens only after download but not after writing or before boot, storage corruption and write errors may be missed.

Reliable systems often verify at multiple points:

verify package after download
verify target partition after writing
bootloader verifies image header or signature before boot
user space reports version and build information after boot

Secure-boot systems extend this into a trust chain covering bootloader, kernel, rootfs, and application package.

Power-Failure Windows Must Be Recoverable

Upgrade must assume power can fail at any point.

The key is that every interruption leaves a clear state.

For example:

download interrupted -> resume download or discard temp package
write B interrupted -> mark B invalid and continue booting A
write completed but not switched -> still boot A
switched but not confirmed -> try B, then roll back to A on failure
confirmed then power loss -> B is the stable version

This requires upgrade state itself to be reliably persisted. State markers cannot become unreadable halfway through.

Common designs include duplicated metadata, version numbers, CRCs, atomic updates, redundant bootloader environments, append-only state records, or storing state in a dedicated reliable area.

Upgrade state matters more than ordinary logs. It decides what the device boots next.

Configuration Migration Is Easy to Miss

A/B slots preserve the old system, but configuration and data are often shared.

That creates migration problems.

The new version may change configuration format, database schema, cache directories, certificate paths, permission model, or device state files. If the migration is irreversible and the new version fails, rollback may return to an old version that cannot understand the new data.

Configuration migration also needs rollback design:

versioned configuration
backup before migration
backward compatibility for a while
idempotent schema migration
commit marker after migration
ability to restore old format or use compatibility paths on rollback

Many upgrade failures are not broken system images. They are shared data changed into a form the old version cannot read.

Application, rootfs, Kernel, and Bootloader Upgrades Differ

Some devices upgrade only applications. Others upgrade rootfs. Others upgrade bootloader, kernel, Device Tree, and applications.

The risk differs by layer.

If an application upgrade fails, the service manager may restart or roll back to the old service.
If rootfs upgrade fails, the system may never reach user space.
If kernel or Device Tree upgrade fails, drivers and rootfs may not start.
If bootloader upgrade fails, the device may lose the ability to load any recovery system.

The closer a component is to the front of the boot chain, the more conservative its upgrade should be.

Bootloaders usually should not be updated frequently. When they must be updated, stronger verification, redundancy, recovery mode, or factory fallback is required.

Rollback Needs Policy

Rollback is not simply “go back on failure.”

The system needs to define:

which failures trigger rollback
whether to retry upgrade after rollback
whether to block the same bad package from reinstalling
whether user data remains compatible with the old version
how the backend learns the device rolled back
whether failure evidence is preserved
whether degraded mode is better than full rollback

If the device fails to confirm only because the network is poor, it may roll back unnecessarily.
If a bad version marks itself successful too early, automatic rollback is lost.
If the backend keeps pushing the same bad package after rollback, the device may repeat the failure loop.

Rollback is part of the upgrade state machine, not a patch in an error path.

How to Debug Upgrade Failures

When OTA fails, the device cannot boot after upgrade, it repeatedly rolls back, or it becomes unavailable, split the path.

First, check package trust and integrity. Did hash, signature, version, and hardware compatibility pass?

Second, check the write target. Was the inactive slot written? Are partition size, offset, and bad-block handling correct?

Third, check bootloader state. Do active slot, try count, success flag, and rollback slot match expectations?

Fourth, identify the boot stage that failed: bootloader, kernel, rootfs, init, service, or network report all have different logs.

Fifth, check health checks. Are confirmation conditions too shallow or too strict?

Sixth, check configuration migration. Do old and new versions share data, and is schema compatibility handled?

Seventh, check power-failure recovery. After interruption at any step, can the device still decide what to boot next?

These questions split “OTA is broken” into download, write, boot, confirm, and rollback stages.

What Matters in Practice

Device upgrade is not file copying. It is a system state transition.

Reliable OTA must consider:

package integrity and authenticity
inactive-slot writing
bootloader boot selection
first-boot health checks
success confirmation and failure rollback
configuration and data migration
recovery after power loss
backend reporting and bad-package suppression

Upgrade success is only one outcome. The engineering goal is that if any step fails, the device does not lose its last working version and does not enter a state that cannot be recovered remotely.