The hardest field failures are often not “the device crashed,” but “the device crashed, rebooted, and left nothing useful behind.”
After reboot, everything may look normal. Services restart, the network reconnects, logs begin from the new boot. Users only know the device was offline. Engineers have to guess: application crash, kernel panic, watchdog reset, power loss, voltage dip, or an external MCU resetting the main processor?
The goal of crash evidence is not to save every log line. Useful evidence should be small enough, reliable enough, and specific enough to let the next boot identify the failure type, failure location, system state, and recovery path.
A useful first model is:
record key state while running
-> failure happens
-> preserve minimal evidence if possible
-> system resets or service restarts
-> next boot reads evidence
-> classify, report, recover, or roll back
Crash evidence is not one log file. It is an evidence chain from runtime, through failure, into the next boot.
First Separate Failure Types
“Crash” is too broad. Different failures need different evidence.
Common categories include:
- application process crash, such as segmentation fault, abort, or uncaught exception
- kernel oops or panic
- watchdog reset
- power loss or brown-out
- CPU runaway or hard fault
- process killed by OOM
- service manager restarting a service intentionally
- external MCU or power-management chip asserting reset
Without this classification, logs can be large but still fail to indicate where to look.
Application crashes need call stacks, signals, versions, and input context. Kernel crashes need panic/oops text, registers, kernel stack, and driver state. Watchdog resets need last heartbeats, task state, and last feed time. Power failures need power-event records, filesystem recovery logs, and unfinished write state.
The first item in minimal crash evidence should be the failure class.
Reset Reason Is the First Clue After Reboot
Embedded devices often provide reset reason or reset cause through the SoC, PMU, RTC, external power chip, or bootloader.
Common causes include:
- power-on reset
- software reset
- watchdog reset
- brown-out or low-voltage reset
- external reset pin
- reboot after panic
- low-power wakeup
Reading and saving reset reason early in boot matters. Many registers are cleared by later initialization. Read too late, and the clue is gone.
Reset reason is not the final answer, but it narrows the search. A watchdog reset and a power-on reset should not lead to the same investigation. Brown-out and software reboot are also very different.
If hardware reset reason is unreliable, software evidence should supplement it: whether the last shutdown was clean, whether a panic marker exists, and whether watchdog evidence was left unreported.
Application Crashes Need Stack and Version
For application process crashes, the most useful evidence is usually:
- process name and pid
- signal or exception type
- fault address
- call stack
- thread list
- registers or minimal CPU context
- program version, build ID, and symbol information
- recent business events
- input parameters or request ID
On Linux, these can come from core dumps, minidumps, crash handlers, or service managers. Space-constrained devices may not be able to store a full core, but should at least preserve enough information to symbolize the stack.
A stack trace without the matching version and symbols loses much of its value. Production devices must be able to map crash addresses back to the exact binary build. Otherwise an address is just a number.
For C and C++ programs, segmentation faults, use-after-free, stack overflow, null pointers, and out-of-bounds writes may all end as process crashes. Without stack and version, the scene is hard to reconstruct.
Kernel Crashes Need panic/oops Evidence
A kernel oops or panic is more dangerous than an application crash because it can compromise the whole system.
Useful evidence includes:
- panic/oops text
- current CPU and process context
- registers
- kernel call stack
- taint flags
- loaded modules
- recent kernel logs
- device driver state
- whether execution was in interrupt, softirq, or workqueue context
The problem is that after a kernel crash, the filesystem may not be safe to write. Ordinary logs may not survive. Linux systems often use pstore, ramoops, kdump, netconsole, or serial logs to store crash information in reserved memory, special storage, or an external machine.
On embedded systems, ramoops is common: reserve a small memory region and read the previous panic or console tail after reboot.
The point is not to save every log. It is to preserve the last high-value kernel evidence before the crash.
Watchdog Resets Need Last Progress
A watchdog reset often has no ordinary crash stack. The system may simply stop making progress and then be reset by hardware.
The most useful evidence is last progress:
- last successful watchdog feed time
- last heartbeat for each task or thread
- key queue lengths
- current state-machine state
- last successful I/O
- current lock or resource wait
- whether the system was in upgrade, flash erase, low-power wake, or another long path
- watchdog trigger count and intervals
These fields cannot wait until the timeout moment. A hardware reset may leave no execution opportunity.
A more reliable design updates a retained area or persistent ring record at low cost during normal operation. Each update writes only a few critical fields, and the next boot reads them.
For watchdog evidence, the key question is where progress stopped, not that reset happened.
Logs Need Time, Order, and Context
Many logs are large but unusable because they lack ordering and context.
Useful logs should include:
- monotonic time or uptime timestamp
- log level
- module name
- thread, task, or pid
- key object ID, such as connection, request, device, or transaction
- error code
- state transition
- version information
Wall-clock time may be wrong before NTP sync, so device logs should preserve monotonic time since boot. Once real time is known, it can be correlated with monotonic time.
Logs should also connect a business path. Writing only “failed” is not enough. It is better to know which request, connection, state machine, or device instance failed.
Ring Logs Fit Field Evidence Better Than Infinite Append
Device storage is limited, and flash has finite write endurance. Infinite append logs are not practical and may create new failures.
Field evidence is often better as rings:
- in-memory ring buffer for recent high-frequency events
- persistent ring for key state changes
- freeze the tail on crash
- upload unreported evidence first after boot
The value of ring logs is preserving what happened just before failure. For many bugs, the last few hundred key events are more useful than complete logs from days ago.
But write cost matters. High-frequency flash writes cause wear and performance problems. Keep ordinary debug logs in memory, and persist low-frequency high-value facts such as state transitions, error codes, reset reasons, and version.
Not Every Log Should Be Synchronous
To avoid losing logs during crashes, some systems try to synchronously write every log line to storage. This is usually expensive.
Synchronous logging can cause:
- higher I/O latency
- flash write amplification
- storage wear
- more complicated power-failure windows
- logs slowing or destabilizing the business path
A better design is layered:
- high-frequency debug logs in an in-memory ring buffer
- key errors and state changes persisted at low frequency
- crash evidence in retained memory, a special partition, or compact records
- business data and logs using different persistence policies
The logging system should not become a new source of instability.
The Next Boot Must Handle Unfinished State
Crash evidence is not only for later analysis. It also supports recovery on the next boot.
At boot, the system should check:
- whether the previous shutdown was clean
- whether panic, watchdog, or brown-out markers exist
- whether an upgrade was unfinished
- whether configuration is in a temporary or commit-in-progress state
- whether database or logs need recovery
- whether crash evidence has not been uploaded
- whether to enter degraded mode or roll back
If the system records crashes but then continues as normal, it may overwrite evidence or keep operating on damaged state.
Reliable systems often include “was the last exit clean?” in their boot flow. A normal shutdown or reboot writes a clean marker; an unexpected reset without that marker enters recovery and reporting.
Privacy and Security Matter
Crash evidence can contain sensitive data.
Core dumps, request parameters, network logs, configuration files, key paths, and user data may end up in evidence. If devices upload these artifacts, sanitization, access control, and retention policy are required.
Common practices include:
- upload only the minimal necessary fields
- filter user data and secrets
- symbolize by build ID on the server instead of uploading full binaries
- protect evidence files with permissions and encryption
- limit log retention time and size
The more detailed the evidence, the more important the boundary. Debugging must not leak sensitive data.
How to Debug Reboots and Crashes
When a device “rebooted,” “hung,” or “went offline,” walk the evidence chain.
First, check reset reason: power-on, software reset, watchdog, brown-out, or reboot after panic point to different paths.
Second, check whether the previous exit was clean. Look for clean markers, unfinished transactions, and unfinished upgrades.
Third, check application crash evidence: core, minidump, signal, stack, and version.
Fourth, check kernel evidence: panic/oops, pstore, ramoops, and kernel log tail.
Fifth, check watchdog evidence: last heartbeat, task state, queue length, and last I/O progress.
Sixth, check whether logs can be correlated by time and request. Is there context, or only isolated error codes?
Seventh, check whether boot flow overwrote the scene. Startup scripts, log rotation, and service restart may erase critical evidence.
These questions turn “occasional reboot” into a traceable evidence path.
What Matters in Practice
Crash evidence is not about having more logs.
Useful evidence answers four questions:
- why the system reset or crashed
- where execution was when it failed
- what the key system state was
- whether the next boot should recover, roll back, or report
Application crashes rely on stack and version. Kernel crashes rely on panic/oops and persistent kernel logs. Watchdog resets rely on last progress. Power loss and brown-out rely on reset reason, filesystem recovery records, and business commit markers.
Devices fail when engineers are not attached with debuggers. Whether an unattended device can leave small, reliable, explainable evidence often decides whether a failure can be fixed or merely labeled “occasional.”