Skip to main content Why Linux Driver Debugging Is More Than printk | IoT Worker

Why Linux Driver Debugging Is More Than printk

The first tool many driver developers reach for is printk. Probe does not run, print a line. Interrupts do not arrive, print a line. DMA does not move, print a line. User space cannot read data, print another line.

That works, but only up to a point:

  • too many logs hide the real failure
  • high-frequency logs slow the system down
  • printing in interrupt or locked paths changes timing
  • production images cannot keep verbose logs enabled
  • an intermittent bug disappears after adding logs
  • multi-instance devices are hard to distinguish

A better model is layered instrumentation:

dev_err/dev_warn/dev_info: record failures and key states
dev_dbg + dynamic debug: enable detailed logs only when needed
tracepoint/ftrace: inspect timing, call paths, and latency
rate limit: prevent logs from damaging the system
debugfs/sysfs: expose necessary state snapshots

Logs are good at answering “what happened”. Tracing is better at answering “when did it happen, which path ran, and where was the latency”.

printk Has Weak Device Context

Raw printk lacks context. With multiple boards, instances, and buses, a log line may not tell which device produced it.

Drivers should usually use device-aware logging:

dev_err(dev, "failed to enable clock: %d\n", ret);
dev_warn(dev, "using fallback configuration\n");
dev_info(dev, "firmware version %u\n", version);
dev_dbg(dev, "irq status=0x%x\n", status);

dev_* logs include device context, making it easier to connect the line to a sysfs device, driver instance, and hardware block.

A basic convention is:

  • dev_err: operation failed, feature unavailable, or state invalid
  • dev_warn: system can continue but has degraded or unexpected state
  • dev_info: low-frequency key state, such as version, mode, or successful init
  • dev_dbg: detailed information for development and diagnosis

Do not print everything as an error. Once error level is abused, real failures are buried.

Log State Transitions

Useful logs are not about quantity. They should sit at state transitions.

Good places include:

  • probe start and failure cause
  • resource acquisition failures, such as clock, regulator, GPIO, IRQ, DMA
  • hardware reset, power-up, and mode changes
  • abnormal interrupt status
  • DMA submission and completion errors
  • runtime PM suspend/resume
  • remove and error rollback
  • important user-space open/read/write/ioctl errors

Poor long-term logs include:

  • every polling iteration
  • every interrupt
  • every packet
  • context-free “here 1” and “here 2”
  • repeated success messages on hot paths

Logs should show where the state machine went wrong, not turn the system into a text stream.

dynamic debug Makes dev_dbg Controllable

dev_dbg is usually not always printed. dynamic debug can enable debug logs at runtime by file, function, module, or format string.

The common control file is:

/sys/kernel/debug/dynamic_debug/control

Examples:

echo 'module my_driver +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file drivers/foo/bar.c +p' > /sys/kernel/debug/dynamic_debug/control

The value is that developers can leave detailed dev_dbg calls in the driver while production stays quiet by default. Field debugging can enable one module, collect detail, and turn it off.

This is better than adding temporary printk because:

  • no kernel rebuild is required
  • one file or function can be targeted
  • details can be collected in the field and disabled again
  • debug logs do not need to be always on

The prerequisite is that the driver already uses dev_dbg for fine-grained detail instead of scattered raw printk.

Rate Limiting Prevents Log Damage

Some driver errors happen on hot paths: interrupt storms, bus errors, DMA timeouts, repeated resets.

Printing every occurrence creates new problems:

  • ring buffer is filled
  • CPU is consumed
  • interrupt and workqueue handling slow down
  • real-time behavior changes
  • the first useful evidence is overwritten
  • persistent logging stresses storage

Linux provides rate-limited logging helpers:

dev_err_ratelimited(dev, "rx timeout\n");
dev_warn_ratelimited(dev, "bad status=0x%x\n", status);

For repeated high-frequency failures, limit the print rate and keep counters for later inspection.

More logs are not always better. For high-frequency errors, the first event, final state, accumulated count, and trigger condition are often more useful than every repeated line.

Tracing Is Better for Timing

Many driver failures are about ordering:

  • interrupt arrives before buffers are ready
  • runtime suspend powers off just before user access
  • workqueue processing is too slow
  • DMA completion races with close
  • probe deferral makes services start before devices exist
  • lock contention causes intermittent timeout

Logs are weak here. Timestamps include print overhead, and too many logs change the timing.

ftrace, tracepoints, function graph tracing, and perf are better tools for paths and latency.

A practical sequence is:

use logs to confirm state
then use trace to inspect timing
then keep counters or debugfs/sysfs entries for recurring diagnosis

Tracing is friendlier to hot paths and can correlate scheduling, interrupts, softirqs, workqueues, and function calls.

tracepoints Are Observation Points

If an event needs long-term observability, a tracepoint may be better than a temporary log.

Logs are for humans. Tracepoints are for tools. They can carry structured fields, be enabled on demand, and integrate with kernel tracing.

Events that may deserve tracepoints include:

  • request submission and completion
  • DMA buffer lifecycle
  • state machine transitions
  • error retries
  • queue depth changes
  • power state changes

Not every driver needs custom tracepoints. Many subsystems already have useful ones, such as block, networking, scheduler, IRQ, workqueue, and dma_fence. Check existing subsystem tracing first.

Logs Change Timing

One counterintuitive fact: adding logs can make a bug disappear.

Reasons include:

  • printing adds delay and changes race windows
  • console output may take locks or block
  • high-frequency logs change scheduling and cache behavior
  • serial output is slow
  • print paths may add memory allocation or lock contention
  • log level and console configuration change actual cost

For intermittent races, timeouts, interrupt ordering, or DMA completion order, do not rely only on adding logs.

Better options are:

  • lightweight counters
  • tracing for event timing
  • bounded ring buffers for recent events
  • avoiding prints in hard IRQ and locked hot paths
  • enabling dynamic debug precisely instead of globally

Logging is an observation tool, not a zero-cost record of truth.

Debugging Facilities Need Lifecycle Handling

Driver debug entries must match the device lifecycle:

  • probe failure should not leave half-created debugfs nodes
  • remove should make old nodes disappear safely
  • register reads during runtime suspend should resume the device or fail clearly
  • suspend/resume should not race with debug access
  • error-path logs should show which cleanup step failed

If a debugfs file dumps registers, reading it must consider whether the device is powered, whether clocks are enabled, and whether register access is safe. Otherwise the debugging interface can create faults.

Instrumentation is part of the driver. It should not bypass resources and power management.

A Practical Debugging Order

For driver failures, use this sequence:

start with dev_err/dev_warn: is there a clear failure
-> inspect probe resources and return codes
-> enable dynamic debug for details
-> read dmesg timestamps and device context
-> use trace/ftrace for timing and call paths
-> use rate limits and counters for hot errors
-> move recurring diagnostic state into debugfs or a stable ABI

Useful entry points include:

dmesg -T
journalctl -k -b
cat /sys/kernel/debug/dynamic_debug/control
echo 'module my_driver +p' > /sys/kernel/debug/dynamic_debug/control
trace-cmd record -e irq -e workqueue
cat /sys/kernel/tracing/trace

Actual availability depends on kernel configuration, rootfs, and permissions. Production devices should still keep enough error logs and diagnostic entry points to locate critical failures.

Driver Debugging Is Observation Design

printk is useful, but it is not a debugging system.

dev_err/dev_warn/dev_info preserve key state. dev_dbg plus dynamic debug provides on-demand detail. trace and ftrace show timing and paths. Rate limiting prevents logs from damaging the system. debugfs and sysfs expose necessary state.

The closer a driver is to interrupts, DMA, power management, and concurrency, the less it can rely on simply printing more lines. Stable debugging comes from planned observation points, not from flooding the system with logs in the field.