Why Linux Driver Debugging Is More Than printk

Reading time: 6 minute Word count: 1244

The first tool many driver developers reach for is printk. Probe does not run, print a line. Interrupts do not arrive, print a line. DMA does not move, print a line. User space cannot read data, print another line.

That works, but only up to a point:

too many logs hide the real failure
high-frequency logs slow the system down
printing in interrupt or locked paths changes timing
production images cannot keep verbose logs enabled
an intermittent bug disappears after adding logs
multi-instance devices are hard to distinguish

A better model is layered instrumentation:

dev_err/dev_warn/dev_info: record failures and key states
dev_dbg + dynamic debug: enable detailed logs only when needed
tracepoint/ftrace: inspect timing, call paths, and latency
rate limit: prevent logs from damaging the system
debugfs/sysfs: expose necessary state snapshots

Logs are good at answering “what happened”. Tracing is better at answering “when did it happen, which path ran, and where was the latency”.

`printk` Has Weak Device Context

Raw printk lacks context. With multiple boards, instances, and buses, a log line may not tell which device produced it.

Drivers should usually use device-aware logging:

dev_err(dev, "failed to enable clock: %d\n", ret);
dev_warn(dev, "using fallback configuration\n");
dev_info(dev, "firmware version %u\n", version);
dev_dbg(dev, "irq status=0x%x\n", status);

dev_* logs include device context, making it easier to connect the line to a sysfs device, driver instance, and hardware block.

A basic convention is:

dev_err: operation failed, feature unavailable, or state invalid
dev_warn: system can continue but has degraded or unexpected state
dev_info: low-frequency key state, such as version, mode, or successful init
dev_dbg: detailed information for development and diagnosis

Do not print everything as an error. Once error level is abused, real failures are buried.

Log State Transitions

Useful logs are not about quantity. They should sit at state transitions.

Good places include:

probe start and failure cause
resource acquisition failures, such as clock, regulator, GPIO, IRQ, DMA
hardware reset, power-up, and mode changes
abnormal interrupt status
DMA submission and completion errors
runtime PM suspend/resume
remove and error rollback
important user-space open/read/write/ioctl errors

Poor long-term logs include:

every polling iteration
every interrupt
every packet
context-free “here 1” and “here 2”
repeated success messages on hot paths

Logs should show where the state machine went wrong, not turn the system into a text stream.

dynamic debug Makes `dev_dbg` Controllable

dev_dbg is usually not always printed. dynamic debug can enable debug logs at runtime by file, function, module, or format string.

The common control file is:

/sys/kernel/debug/dynamic_debug/control

Examples:

echo 'module my_driver +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file drivers/foo/bar.c +p' > /sys/kernel/debug/dynamic_debug/control

The value is that developers can leave detailed dev_dbg calls in the driver while production stays quiet by default. Field debugging can enable one module, collect detail, and turn it off.

This is better than adding temporary printk because:

no kernel rebuild is required
one file or function can be targeted
details can be collected in the field and disabled again
debug logs do not need to be always on

The prerequisite is that the driver already uses dev_dbg for fine-grained detail instead of scattered raw printk.

Rate Limiting Prevents Log Damage

Some driver errors happen on hot paths: interrupt storms, bus errors, DMA timeouts, repeated resets.

Printing every occurrence creates new problems:

ring buffer is filled
CPU is consumed
interrupt and workqueue handling slow down
real-time behavior changes
the first useful evidence is overwritten
persistent logging stresses storage

Linux provides rate-limited logging helpers:

dev_err_ratelimited(dev, "rx timeout\n");
dev_warn_ratelimited(dev, "bad status=0x%x\n", status);

For repeated high-frequency failures, limit the print rate and keep counters for later inspection.

More logs are not always better. For high-frequency errors, the first event, final state, accumulated count, and trigger condition are often more useful than every repeated line.

Tracing Is Better for Timing

Many driver failures are about ordering:

interrupt arrives before buffers are ready
runtime suspend powers off just before user access
workqueue processing is too slow
DMA completion races with close
probe deferral makes services start before devices exist
lock contention causes intermittent timeout

Logs are weak here. Timestamps include print overhead, and too many logs change the timing.

ftrace, tracepoints, function graph tracing, and perf are better tools for paths and latency.

A practical sequence is:

use logs to confirm state
then use trace to inspect timing
then keep counters or debugfs/sysfs entries for recurring diagnosis

Tracing is friendlier to hot paths and can correlate scheduling, interrupts, softirqs, workqueues, and function calls.

tracepoints Are Observation Points

If an event needs long-term observability, a tracepoint may be better than a temporary log.

Logs are for humans. Tracepoints are for tools. They can carry structured fields, be enabled on demand, and integrate with kernel tracing.

Events that may deserve tracepoints include:

request submission and completion
DMA buffer lifecycle
state machine transitions
error retries
queue depth changes
power state changes

Not every driver needs custom tracepoints. Many subsystems already have useful ones, such as block, networking, scheduler, IRQ, workqueue, and dma_fence. Check existing subsystem tracing first.

Logs Change Timing

One counterintuitive fact: adding logs can make a bug disappear.

Reasons include:

printing adds delay and changes race windows
console output may take locks or block
high-frequency logs change scheduling and cache behavior
serial output is slow
print paths may add memory allocation or lock contention
log level and console configuration change actual cost

For intermittent races, timeouts, interrupt ordering, or DMA completion order, do not rely only on adding logs.

Better options are:

lightweight counters
tracing for event timing
bounded ring buffers for recent events
avoiding prints in hard IRQ and locked hot paths
enabling dynamic debug precisely instead of globally

Logging is an observation tool, not a zero-cost record of truth.

Debugging Facilities Need Lifecycle Handling

Driver debug entries must match the device lifecycle:

probe failure should not leave half-created debugfs nodes
remove should make old nodes disappear safely
register reads during runtime suspend should resume the device or fail clearly
suspend/resume should not race with debug access
error-path logs should show which cleanup step failed

If a debugfs file dumps registers, reading it must consider whether the device is powered, whether clocks are enabled, and whether register access is safe. Otherwise the debugging interface can create faults.

Instrumentation is part of the driver. It should not bypass resources and power management.

A Practical Debugging Order

For driver failures, use this sequence:

start with dev_err/dev_warn: is there a clear failure
-> inspect probe resources and return codes
-> enable dynamic debug for details
-> read dmesg timestamps and device context
-> use trace/ftrace for timing and call paths
-> use rate limits and counters for hot errors
-> move recurring diagnostic state into debugfs or a stable ABI

Useful entry points include:

dmesg -T
journalctl -k -b
cat /sys/kernel/debug/dynamic_debug/control
echo 'module my_driver +p' > /sys/kernel/debug/dynamic_debug/control
trace-cmd record -e irq -e workqueue
cat /sys/kernel/tracing/trace

Actual availability depends on kernel configuration, rootfs, and permissions. Production devices should still keep enough error logs and diagnostic entry points to locate critical failures.

Driver Debugging Is Observation Design

printk is useful, but it is not a debugging system.

dev_err/dev_warn/dev_info preserve key state. dev_dbg plus dynamic debug provides on-demand detail. trace and ftrace show timing and paths. Rate limiting prevents logs from damaging the system. debugfs and sysfs expose necessary state.

The closer a driver is to interrupts, DMA, power management, and concurrency, the less it can rely on simply printing more lines. Stable debugging comes from planned observation points, not from flooding the system with logs in the field.