Driver debugging often focuses on the successful probe path: acquire resources, map registers, request interrupt, initialize hardware, register user-space interface, then print “probe ok”.
Field bugs often hide on another path:
- probe fails halfway and does not roll back cleanly
- probe returns an error while IRQ or workqueue is still active
- user space keeps an fd after remove
- runtime suspend is powering down while the error path releases resources
- DMA buffer is freed while the device is still writing
devm_is used, but object lifetime is not what the driver expected
The hard part is not initializing hardware once. The hard part is: if any step fails, the device is removed, the module unloads, the system suspends, or user space still holds an fd, the driver must stop everything it has already started in the right order.
Think of probe as progressively opening resources:
allocate private object
-> acquire MMIO/clock/regulator/reset/GPIO
-> request IRQ / DMA / buffers / workqueue
-> initialize hardware
-> register subsystem object
-> expose user-space entry
Failure paths and remove must close this chain in reverse, while handling concurrency.
The Successful probe Path Looks Too Simple
A successful probe path often looks linear:
priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
base = devm_ioremap_resource(dev, res);
irq = platform_get_irq(pdev, 0);
ret = devm_request_irq(dev, irq, handler, 0, name, priv);
ret = register_device(priv);
That path is easy to test.
The real problem is that every step can fail:
- memory allocation
- missing register resources
- clock/regulator provider not ready
- wrong IRQ
- DMA mask setup
- hardware reset timeout
- subsystem registration
- device node creation
If step seven fails, the previous six successful steps may need to be undone. Who undoes them, in what order, must be clear.
devm_ Reduces Rollback, but Does Not Design Lifetimes
devm_ APIs bind resources to struct device. On probe failure or device unbind, devres releases them automatically.
Useful examples include:
devm_kzallocdevm_ioremap_resourcedevm_clk_getdevm_regulator_getdevm_request_irqdevm_gpiod_get
But devm_ does not mean remove bugs disappear.
Important boundaries:
devm_releases resources, but may not stop the hardware state machine- automatic release order may not match the business shutdown order
- workqueues, timers, threads, and DMA may still be using resources
- user-space fd lifetime may outlive parts of device removal
- objects registered to other subsystems are not always devm-managed
- some resources must be disabled before being released
For example, devm_request_irq releases the IRQ, but if the hardware interrupt source is not disabled first, an interrupt can still race during remove. devm_kzalloc releases memory, but delayed work may still reference it.
devm_ registers cleanup. It does not replace driver shutdown design.
Error Paths Must Roll Back What Already Succeeded
A robust model is: every successful step has a known reverse action.
clock enabled -> disable it on failure
regulator enabled -> disable it on failure
IRQ requested -> free it or rely on devm after stopping hardware
workqueue started -> cancel/sync it
DMA started -> stop engine / unmap / free
subsystem registered -> unregister it
Common error-path bugs include:
- jumping to the wrong label and releasing uninitialized resources
- forgetting to disable clocks or regulators
- unregistering before stopping hardware
- freeing memory before cancelling work
- unmapping DMA buffers while hardware is still active
- mixing probe deferral with permanent failure handling
goto is not the problem. Unclear phase boundaries are.
-EPROBE_DEFER Is Not Ordinary Failure
When probe requests clocks, regulators, GPIOs, resets, or pinctrl states, a provider may not be ready yet. The driver can receive -EPROBE_DEFER.
That means the dependency is not ready and the kernel will retry probe later.
It differs from permanent failure:
- do not log it as a permanent error repeatedly
- do not leave hardware half-initialized
- do not start background work or user-space interfaces
- do not mark the device unrecoverable
If resources were already enabled before deferral, they still need rollback. Otherwise the next probe attempt sees hardware polluted by the previous partial attempt.
remove Is Not Just Reverse probe
During remove, the driver must stop the outside world from accessing the device.
A typical order is closer to:
block new requests
-> unregister user-space/subsystem entry
-> stop hardware from producing new events
-> disable IRQ / cancel work / stop timer
-> wait for running paths to exit
-> stop DMA
-> disable clocks/regulators/reset
-> release resources
The common missing piece is “running paths”:
- IRQ handler is active
- threaded IRQ is running
- workqueue has not completed
- timer is about to fire
- user-space read is sleeping on a wait queue
- mmap remains mapped
- DMA completion callback has not returned
If remove frees resources without stopping these paths first, use-after-free, NULL dereference, or hardware access faults follow.
User-Space fd Can Outlive Device Removal
Char devices and many subsystem interfaces have a real issue: an application can keep an fd open while the device is removed.
The driver must define:
- whether new open is rejected
- what existing read/write/ioctl return
- whether poll wakes and returns an error
- whether blocking read wakes up
- how mmap access is handled
- whether the object still exists at close
Do not assume all user-space users exit before remove.
A common pattern is to keep a present, dying, or state flag. remove marks the device unavailable, wakes wait queues, blocks new requests, and either waits for references to drain or makes later operations return -ENODEV, -EIO, or another clear error.
Stop IRQ, workqueue, and timer Before Freeing
Asynchronous execution is where remove and error paths break most often.
Check:
- can IRQ handler run during remove
- does threaded IRQ need synchronization
- is delayed work queued
- can a timer re-arm itself
- does work access MMIO, DMA buffers, or private objects
Common shutdown tools include:
disable_irq(irq);
cancel_work_sync(&work);
cancel_delayed_work_sync(&dwork);
del_timer_sync(&timer);
The exact order depends on the driver. The rule is: before freeing anything an async path can use, ensure that path cannot run again.
DMA Cleanup Must Stop the Device First
DMA cleanup is dangerous because hardware may continue accessing memory after the CPU has freed it.
A safe order is usually:
stop submitting new DMA
-> wait for or abort in-flight DMA
-> confirm hardware will not touch the buffer
-> unmap DMA
-> free buffer
If memory is freed or unmapped before DMA is stopped, the device may write to memory returned to the system. This often appears as random memory corruption rather than a clear driver bug.
DMA completion, IRQ, workqueue, close, and remove often overlap. State flags, locks, and a clear stop order are essential.
runtime PM Makes remove Harder
runtime PM lets an idle device suspend and resume automatically. It also makes error and remove paths trickier.
Consider:
- is the device runtime suspended or active
- must the driver resume before touching registers
- are clocks/regulators already disabled by PM
- are
pm_runtime_getcalls balanced - can workqueue trigger resume
- can user-space access race with runtime suspend
If remove reads registers while the device is runtime suspended, it may fault. If remove does not block new requests first, user space may trigger runtime resume while resources are being released.
runtime PM is not something to add at the end. It must be designed with probe, remove, open/close, IRQ, and workqueue lifetimes.
Debug the First Failure and the Final Cleanup
Probe failure logs often show only the final error. The actual cause may be an earlier resource acquisition or state transition.
Inspect this chain:
where probe started
-> which resource failed first
-> what was already enabled or registered
-> whether the return was -EPROBE_DEFER
-> whether the error path disabled enabled resources
-> whether remove blocks new requests
-> whether IRQ/work/timer/DMA are stopped synchronously
-> whether user-space fd can still access the object
-> whether runtime PM state is consistent
Logs should show both the first failure and the cleanup path. A single probe failed line is not enough.
Successful probe Is Not the End of the Lifecycle
The Linux driver lifecycle does not end when probe succeeds. Probe only brings the device into a usable state. After that come user-space access, interrupts, DMA, workqueues, runtime PM, suspend/resume, remove, and error rollback.
The successful path is easiest to write and easiest to test. Driver stability is decided when initialization fails halfway, removal happens halfway, suspend happens halfway, or user space is still holding references.
When writing each probe step, write down its reverse action. That is the start of avoiding lifecycle bugs.