The worst remote-device upgrade failure is not “the upgrade failed.” It is “the upgrade failed and the device never comes back.”
Many IoT devices are deployed in the field. Engineers cannot easily open them, flash them manually, or attach a serial console. If one OTA update corrupts the boot slot, migrates configuration irreversibly, or switches to a new system that cannot connect back to confirm success, the device may become unrecoverable remotely.
So device upgrade is not just downloading a package, overwriting files, and rebooting.
A useful first model is: reliable upgrade is a state machine across application, filesystem, bootloader, partition layout, configuration migration, and health checks. Every step must leave recoverable state after power loss, reboot, write failure, or a broken new version.
download update package
-> verify integrity and signature
-> write inactive slot or temporary area
-> mark next boot to new version
-> reboot into new version
-> new version passes health check
-> confirm upgrade success
-> rollback old version on failure
The most important goal is not “can upgrade succeed,” but “can any failed step preserve a path back.”
Why Overwriting the Current System Is Dangerous
The most obvious upgrade method is overwriting the running system with the new version.
This is risky.
During upgrade, many things can happen:
- power loss
- network interruption
- storage write failure
- corrupted package
- filesystem out of space
- processes still using old files
- dependencies half-updated
- reboot in the middle of the process
If the current system is overwritten halfway, the next boot may be neither old nor new. The bootloader may load the kernel, but rootfs is incomplete. The application may start, but libraries do not match. Configuration may be migrated, but the new program does not run.
Reliable OTA usually avoids destroying the currently working system. It writes the new version to another slot, temporary area, or verifiable image first, and switches only after it can be booted.
A/B Slots Preserve the Old Version
A/B partitioning is common in device upgrades.
The system has two slots:
slot A: currently running version
slot B: upgrade target
When the device boots from A, the updater writes the new version to B. After writing and verification, it only changes the boot flag so the bootloader tries B next time.
If B boots successfully and passes health checks, the system marks B as confirmed.
If B fails to boot, keeps rebooting, or never confirms success, the bootloader or upgrade manager rolls back to A.
The core value of A/B is that the old version remains available while the new version is being installed. On failure, at least one known-good system remains.
The cost is storage: more space is required, partition layout is more complex, and configuration/data partitions must be designed separately.
Bootloader Flags Decide the Next Boot
OTA is not only a user-space action. What actually boots is usually decided by the bootloader.
The bootloader needs to know:
- current active slot
- whether the new slot may be tried
- how many tries remain
- whether the last boot was confirmed successful
- whether rollback is required
- whether image verification passed
Common state includes:
active_slot = B
boot_try_count = 3
boot_success = false
rollback_slot = A
When the new system boots for the first time, it usually should not mark itself successful immediately. It should first pass minimal health checks: kernel up, rootfs usable, critical services running, network available, and status reported.
Only after those conditions are met should the user-space upgrade service write the “confirmed successful” flag.
If the device reboots before confirmation, the bootloader reduces the try count. Once tries are exhausted, it rolls back to the old slot.
Health Checks Must Be Deeper Than Process Started
Upgrade confirmation is often too shallow.
If the main process starting is enough to mark success, many failures are missed:
- network cannot connect
- device certificate or key is unavailable
- data partition fails to mount
- critical peripherals fail to probe
- configuration migration fails
- application runs but cannot reach backend
- watchdog resets shortly after boot
- new version is incompatible with the hardware revision
Health checks should match the product’s minimum usable definition.
For a connected device, that may include:
- system completed boot
- critical services are running
- data partition is readable and writable
- required drivers and device nodes exist
- network connection is established
- backend receives the new version status
- watchdog and crash logs show no repeated failures
Confirming too early pins a bad version. Confirming too late may mistake normal slow boot for failure. This must be designed together with boot time, network environment, and low-power behavior.
Verify Before Crossing Trust Boundaries
An update package needs at least two kinds of verification.
The first is integrity verification: hash, length, or chunk checks. It answers whether the package downloaded completely and without corruption.
The second is authenticity verification: signature checking. It answers whether the package was produced by a trusted publisher.
If only hash is checked, an attacker can replace both package and hash.
If verification happens only after download but not after writing or before boot, storage corruption and write errors may be missed.
Reliable systems often verify at multiple points:
- verify package after download
- verify target partition after writing
- bootloader verifies image header or signature before boot
- user space reports version and build information after boot
Secure-boot systems extend this into a trust chain covering bootloader, kernel, rootfs, and application package.
Power-Failure Windows Must Be Recoverable
Upgrade must assume power can fail at any point.
The key is that every interruption leaves a clear state.
For example:
download interrupted -> resume download or discard temp package
write B interrupted -> mark B invalid and continue booting A
write completed but not switched -> still boot A
switched but not confirmed -> try B, then roll back to A on failure
confirmed then power loss -> B is the stable version
This requires upgrade state itself to be reliably persisted. State markers cannot become unreadable halfway through.
Common designs include duplicated metadata, version numbers, CRCs, atomic updates, redundant bootloader environments, append-only state records, or storing state in a dedicated reliable area.
Upgrade state matters more than ordinary logs. It decides what the device boots next.
Configuration Migration Is Easy to Miss
A/B slots preserve the old system, but configuration and data are often shared.
That creates migration problems.
The new version may change configuration format, database schema, cache directories, certificate paths, permission model, or device state files. If the migration is irreversible and the new version fails, rollback may return to an old version that cannot understand the new data.
Configuration migration also needs rollback design:
- versioned configuration
- backup before migration
- backward compatibility for a while
- idempotent schema migration
- commit marker after migration
- ability to restore old format or use compatibility paths on rollback
Many upgrade failures are not broken system images. They are shared data changed into a form the old version cannot read.
Application, rootfs, Kernel, and Bootloader Upgrades Differ
Some devices upgrade only applications. Others upgrade rootfs. Others upgrade bootloader, kernel, Device Tree, and applications.
The risk differs by layer.
If an application upgrade fails, the service manager may restart or roll back to the old service.
If rootfs upgrade fails, the system may never reach user space.
If kernel or Device Tree upgrade fails, drivers and rootfs may not start.
If bootloader upgrade fails, the device may lose the ability to load any recovery system.
The closer a component is to the front of the boot chain, the more conservative its upgrade should be.
Bootloaders usually should not be updated frequently. When they must be updated, stronger verification, redundancy, recovery mode, or factory fallback is required.
Rollback Needs Policy
Rollback is not simply “go back on failure.”
The system needs to define:
- which failures trigger rollback
- whether to retry upgrade after rollback
- whether to block the same bad package from reinstalling
- whether user data remains compatible with the old version
- how the backend learns the device rolled back
- whether failure evidence is preserved
- whether degraded mode is better than full rollback
If the device fails to confirm only because the network is poor, it may roll back unnecessarily.
If a bad version marks itself successful too early, automatic rollback is lost.
If the backend keeps pushing the same bad package after rollback, the device may repeat the failure loop.
Rollback is part of the upgrade state machine, not a patch in an error path.
How to Debug Upgrade Failures
When OTA fails, the device cannot boot after upgrade, it repeatedly rolls back, or it becomes unavailable, split the path.
First, check package trust and integrity. Did hash, signature, version, and hardware compatibility pass?
Second, check the write target. Was the inactive slot written? Are partition size, offset, and bad-block handling correct?
Third, check bootloader state. Do active slot, try count, success flag, and rollback slot match expectations?
Fourth, identify the boot stage that failed: bootloader, kernel, rootfs, init, service, or network report all have different logs.
Fifth, check health checks. Are confirmation conditions too shallow or too strict?
Sixth, check configuration migration. Do old and new versions share data, and is schema compatibility handled?
Seventh, check power-failure recovery. After interruption at any step, can the device still decide what to boot next?
These questions split “OTA is broken” into download, write, boot, confirm, and rollback stages.
What Matters in Practice
Device upgrade is not file copying. It is a system state transition.
Reliable OTA must consider:
- package integrity and authenticity
- inactive-slot writing
- bootloader boot selection
- first-boot health checks
- success confirmation and failure rollback
- configuration and data migration
- recovery after power loss
- backend reporting and bad-package suppression
Upgrade success is only one outcome. The engineering goal is that if any step fails, the device does not lose its last working version and does not enter a state that cannot be recovered remotely.