What Crash Evidence and Logs Should Preserve

Reading time: 8 minute Word count: 1671

Operating Systems Crash Logs Debugging Reliability Embedded

The hardest field failures are often not “the device crashed,” but “the device crashed, rebooted, and left nothing useful behind.”

After reboot, everything may look normal. Services restart, the network reconnects, logs begin from the new boot. Users only know the device was offline. Engineers have to guess: application crash, kernel panic, watchdog reset, power loss, voltage dip, or an external MCU resetting the main processor?

The goal of crash evidence is not to save every log line. Useful evidence should be small enough, reliable enough, and specific enough to let the next boot identify the failure type, failure location, system state, and recovery path.

A useful first model is:

record key state while running
-> failure happens
-> preserve minimal evidence if possible
-> system resets or service restarts
-> next boot reads evidence
-> classify, report, recover, or roll back

Crash evidence is not one log file. It is an evidence chain from runtime, through failure, into the next boot.

First Separate Failure Types

“Crash” is too broad. Different failures need different evidence.

Common categories include:

application process crash, such as segmentation fault, abort, or uncaught exception
kernel oops or panic
watchdog reset
power loss or brown-out
CPU runaway or hard fault
process killed by OOM
service manager restarting a service intentionally
external MCU or power-management chip asserting reset

Without this classification, logs can be large but still fail to indicate where to look.

Application crashes need call stacks, signals, versions, and input context. Kernel crashes need panic/oops text, registers, kernel stack, and driver state. Watchdog resets need last heartbeats, task state, and last feed time. Power failures need power-event records, filesystem recovery logs, and unfinished write state.

The first item in minimal crash evidence should be the failure class.

Reset Reason Is the First Clue After Reboot

Embedded devices often provide reset reason or reset cause through the SoC, PMU, RTC, external power chip, or bootloader.

Common causes include:

power-on reset
software reset
watchdog reset
brown-out or low-voltage reset
external reset pin
reboot after panic
low-power wakeup

Reading and saving reset reason early in boot matters. Many registers are cleared by later initialization. Read too late, and the clue is gone.

Reset reason is not the final answer, but it narrows the search. A watchdog reset and a power-on reset should not lead to the same investigation. Brown-out and software reboot are also very different.

If hardware reset reason is unreliable, software evidence should supplement it: whether the last shutdown was clean, whether a panic marker exists, and whether watchdog evidence was left unreported.

Application Crashes Need Stack and Version

For application process crashes, the most useful evidence is usually:

process name and pid
signal or exception type
fault address
call stack
thread list
registers or minimal CPU context
program version, build ID, and symbol information
recent business events
input parameters or request ID

On Linux, these can come from core dumps, minidumps, crash handlers, or service managers. Space-constrained devices may not be able to store a full core, but should at least preserve enough information to symbolize the stack.

A stack trace without the matching version and symbols loses much of its value. Production devices must be able to map crash addresses back to the exact binary build. Otherwise an address is just a number.

For C and C++ programs, segmentation faults, use-after-free, stack overflow, null pointers, and out-of-bounds writes may all end as process crashes. Without stack and version, the scene is hard to reconstruct.

Kernel Crashes Need panic/oops Evidence

A kernel oops or panic is more dangerous than an application crash because it can compromise the whole system.

Useful evidence includes:

panic/oops text
current CPU and process context
registers
kernel call stack
taint flags
loaded modules
recent kernel logs
device driver state
whether execution was in interrupt, softirq, or workqueue context

The problem is that after a kernel crash, the filesystem may not be safe to write. Ordinary logs may not survive. Linux systems often use pstore, ramoops, kdump, netconsole, or serial logs to store crash information in reserved memory, special storage, or an external machine.

On embedded systems, ramoops is common: reserve a small memory region and read the previous panic or console tail after reboot.

The point is not to save every log. It is to preserve the last high-value kernel evidence before the crash.

Watchdog Resets Need Last Progress

A watchdog reset often has no ordinary crash stack. The system may simply stop making progress and then be reset by hardware.

The most useful evidence is last progress:

last successful watchdog feed time
last heartbeat for each task or thread
key queue lengths
current state-machine state
last successful I/O
current lock or resource wait
whether the system was in upgrade, flash erase, low-power wake, or another long path
watchdog trigger count and intervals

These fields cannot wait until the timeout moment. A hardware reset may leave no execution opportunity.

A more reliable design updates a retained area or persistent ring record at low cost during normal operation. Each update writes only a few critical fields, and the next boot reads them.

For watchdog evidence, the key question is where progress stopped, not that reset happened.

Logs Need Time, Order, and Context

Many logs are large but unusable because they lack ordering and context.

Useful logs should include:

monotonic time or uptime timestamp
log level
module name
thread, task, or pid
key object ID, such as connection, request, device, or transaction
error code
state transition
version information

Wall-clock time may be wrong before NTP sync, so device logs should preserve monotonic time since boot. Once real time is known, it can be correlated with monotonic time.

Logs should also connect a business path. Writing only “failed” is not enough. It is better to know which request, connection, state machine, or device instance failed.

Ring Logs Fit Field Evidence Better Than Infinite Append

Device storage is limited, and flash has finite write endurance. Infinite append logs are not practical and may create new failures.

Field evidence is often better as rings:

in-memory ring buffer for recent high-frequency events
persistent ring for key state changes
freeze the tail on crash
upload unreported evidence first after boot

The value of ring logs is preserving what happened just before failure. For many bugs, the last few hundred key events are more useful than complete logs from days ago.

But write cost matters. High-frequency flash writes cause wear and performance problems. Keep ordinary debug logs in memory, and persist low-frequency high-value facts such as state transitions, error codes, reset reasons, and version.

Not Every Log Should Be Synchronous

To avoid losing logs during crashes, some systems try to synchronously write every log line to storage. This is usually expensive.

Synchronous logging can cause:

higher I/O latency
flash write amplification
storage wear
more complicated power-failure windows
logs slowing or destabilizing the business path

A better design is layered:

high-frequency debug logs in an in-memory ring buffer
key errors and state changes persisted at low frequency
crash evidence in retained memory, a special partition, or compact records
business data and logs using different persistence policies

The logging system should not become a new source of instability.

The Next Boot Must Handle Unfinished State

Crash evidence is not only for later analysis. It also supports recovery on the next boot.

At boot, the system should check:

whether the previous shutdown was clean
whether panic, watchdog, or brown-out markers exist
whether an upgrade was unfinished
whether configuration is in a temporary or commit-in-progress state
whether database or logs need recovery
whether crash evidence has not been uploaded
whether to enter degraded mode or roll back

If the system records crashes but then continues as normal, it may overwrite evidence or keep operating on damaged state.

Reliable systems often include “was the last exit clean?” in their boot flow. A normal shutdown or reboot writes a clean marker; an unexpected reset without that marker enters recovery and reporting.

Privacy and Security Matter

Crash evidence can contain sensitive data.

Core dumps, request parameters, network logs, configuration files, key paths, and user data may end up in evidence. If devices upload these artifacts, sanitization, access control, and retention policy are required.

Common practices include:

upload only the minimal necessary fields
filter user data and secrets
symbolize by build ID on the server instead of uploading full binaries
protect evidence files with permissions and encryption
limit log retention time and size

The more detailed the evidence, the more important the boundary. Debugging must not leak sensitive data.

How to Debug Reboots and Crashes

When a device “rebooted,” “hung,” or “went offline,” walk the evidence chain.

First, check reset reason: power-on, software reset, watchdog, brown-out, or reboot after panic point to different paths.

Second, check whether the previous exit was clean. Look for clean markers, unfinished transactions, and unfinished upgrades.

Third, check application crash evidence: core, minidump, signal, stack, and version.

Fourth, check kernel evidence: panic/oops, pstore, ramoops, and kernel log tail.

Fifth, check watchdog evidence: last heartbeat, task state, queue length, and last I/O progress.

Sixth, check whether logs can be correlated by time and request. Is there context, or only isolated error codes?

Seventh, check whether boot flow overwrote the scene. Startup scripts, log rotation, and service restart may erase critical evidence.

These questions turn “occasional reboot” into a traceable evidence path.

What Matters in Practice

Crash evidence is not about having more logs.

Useful evidence answers four questions:

why the system reset or crashed
where execution was when it failed
what the key system state was
whether the next boot should recover, roll back, or report

Application crashes rely on stack and version. Kernel crashes rely on panic/oops and persistent kernel logs. Watchdog resets rely on last progress. Power loss and brown-out rely on reset reason, filesystem recovery records, and business commit markers.

Devices fail when engineers are not attached with debuggers. Whether an unattended device can leave small, reliable, explainable evidence often decides whether a failure can be fixed or merely labeled “occasional.”