Why Linux Driver probe Failure Paths Are More Bug-Prone Than Successful probe

Reading time: 7 minute Word count: 1381

Linux Drivers Linux probe remove devm

Driver debugging often focuses on the successful probe path: acquire resources, map registers, request interrupt, initialize hardware, register user-space interface, then print “probe ok”.

Field bugs often hide on another path:

probe fails halfway and does not roll back cleanly
probe returns an error while IRQ or workqueue is still active
user space keeps an fd after remove
runtime suspend is powering down while the error path releases resources
DMA buffer is freed while the device is still writing
devm_ is used, but object lifetime is not what the driver expected

The hard part is not initializing hardware once. The hard part is: if any step fails, the device is removed, the module unloads, the system suspends, or user space still holds an fd, the driver must stop everything it has already started in the right order.

Think of probe as progressively opening resources:

allocate private object
-> acquire MMIO/clock/regulator/reset/GPIO
-> request IRQ / DMA / buffers / workqueue
-> initialize hardware
-> register subsystem object
-> expose user-space entry

Failure paths and remove must close this chain in reverse, while handling concurrency.

The Successful probe Path Looks Too Simple

A successful probe path often looks linear:

priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
base = devm_ioremap_resource(dev, res);
irq = platform_get_irq(pdev, 0);
ret = devm_request_irq(dev, irq, handler, 0, name, priv);
ret = register_device(priv);

That path is easy to test.

The real problem is that every step can fail:

memory allocation
missing register resources
clock/regulator provider not ready
wrong IRQ
DMA mask setup
hardware reset timeout
subsystem registration
device node creation

If step seven fails, the previous six successful steps may need to be undone. Who undoes them, in what order, must be clear.

`devm_` Reduces Rollback, but Does Not Design Lifetimes

devm_ APIs bind resources to struct device. On probe failure or device unbind, devres releases them automatically.

Useful examples include:

devm_kzalloc
devm_ioremap_resource
devm_clk_get
devm_regulator_get
devm_request_irq
devm_gpiod_get

But devm_ does not mean remove bugs disappear.

Important boundaries:

devm_ releases resources, but may not stop the hardware state machine
automatic release order may not match the business shutdown order
workqueues, timers, threads, and DMA may still be using resources
user-space fd lifetime may outlive parts of device removal
objects registered to other subsystems are not always devm-managed
some resources must be disabled before being released

For example, devm_request_irq releases the IRQ, but if the hardware interrupt source is not disabled first, an interrupt can still race during remove. devm_kzalloc releases memory, but delayed work may still reference it.

devm_ registers cleanup. It does not replace driver shutdown design.

Error Paths Must Roll Back What Already Succeeded

A robust model is: every successful step has a known reverse action.

clock enabled -> disable it on failure
regulator enabled -> disable it on failure
IRQ requested -> free it or rely on devm after stopping hardware
workqueue started -> cancel/sync it
DMA started -> stop engine / unmap / free
subsystem registered -> unregister it

Common error-path bugs include:

jumping to the wrong label and releasing uninitialized resources
forgetting to disable clocks or regulators
unregistering before stopping hardware
freeing memory before cancelling work
unmapping DMA buffers while hardware is still active
mixing probe deferral with permanent failure handling

goto is not the problem. Unclear phase boundaries are.

`-EPROBE_DEFER` Is Not Ordinary Failure

When probe requests clocks, regulators, GPIOs, resets, or pinctrl states, a provider may not be ready yet. The driver can receive -EPROBE_DEFER.

That means the dependency is not ready and the kernel will retry probe later.

It differs from permanent failure:

do not log it as a permanent error repeatedly
do not leave hardware half-initialized
do not start background work or user-space interfaces
do not mark the device unrecoverable

If resources were already enabled before deferral, they still need rollback. Otherwise the next probe attempt sees hardware polluted by the previous partial attempt.

remove Is Not Just Reverse probe

During remove, the driver must stop the outside world from accessing the device.

A typical order is closer to:

block new requests
-> unregister user-space/subsystem entry
-> stop hardware from producing new events
-> disable IRQ / cancel work / stop timer
-> wait for running paths to exit
-> stop DMA
-> disable clocks/regulators/reset
-> release resources

The common missing piece is “running paths”:

IRQ handler is active
threaded IRQ is running
workqueue has not completed
timer is about to fire
user-space read is sleeping on a wait queue
mmap remains mapped
DMA completion callback has not returned

If remove frees resources without stopping these paths first, use-after-free, NULL dereference, or hardware access faults follow.

User-Space fd Can Outlive Device Removal

Char devices and many subsystem interfaces have a real issue: an application can keep an fd open while the device is removed.

The driver must define:

whether new open is rejected
what existing read/write/ioctl return
whether poll wakes and returns an error
whether blocking read wakes up
how mmap access is handled
whether the object still exists at close

Do not assume all user-space users exit before remove.

A common pattern is to keep a present, dying, or state flag. remove marks the device unavailable, wakes wait queues, blocks new requests, and either waits for references to drain or makes later operations return -ENODEV, -EIO, or another clear error.

Stop IRQ, workqueue, and timer Before Freeing

Asynchronous execution is where remove and error paths break most often.

Check:

can IRQ handler run during remove
does threaded IRQ need synchronization
is delayed work queued
can a timer re-arm itself
does work access MMIO, DMA buffers, or private objects

Common shutdown tools include:

disable_irq(irq);
cancel_work_sync(&work);
cancel_delayed_work_sync(&dwork);
del_timer_sync(&timer);

The exact order depends on the driver. The rule is: before freeing anything an async path can use, ensure that path cannot run again.

DMA Cleanup Must Stop the Device First

DMA cleanup is dangerous because hardware may continue accessing memory after the CPU has freed it.

A safe order is usually:

stop submitting new DMA
-> wait for or abort in-flight DMA
-> confirm hardware will not touch the buffer
-> unmap DMA
-> free buffer

If memory is freed or unmapped before DMA is stopped, the device may write to memory returned to the system. This often appears as random memory corruption rather than a clear driver bug.

DMA completion, IRQ, workqueue, close, and remove often overlap. State flags, locks, and a clear stop order are essential.

runtime PM Makes remove Harder

runtime PM lets an idle device suspend and resume automatically. It also makes error and remove paths trickier.

Consider:

is the device runtime suspended or active
must the driver resume before touching registers
are clocks/regulators already disabled by PM
are pm_runtime_get calls balanced
can workqueue trigger resume
can user-space access race with runtime suspend

If remove reads registers while the device is runtime suspended, it may fault. If remove does not block new requests first, user space may trigger runtime resume while resources are being released.

runtime PM is not something to add at the end. It must be designed with probe, remove, open/close, IRQ, and workqueue lifetimes.

Debug the First Failure and the Final Cleanup

Probe failure logs often show only the final error. The actual cause may be an earlier resource acquisition or state transition.

Inspect this chain:

where probe started
-> which resource failed first
-> what was already enabled or registered
-> whether the return was -EPROBE_DEFER
-> whether the error path disabled enabled resources
-> whether remove blocks new requests
-> whether IRQ/work/timer/DMA are stopped synchronously
-> whether user-space fd can still access the object
-> whether runtime PM state is consistent

Logs should show both the first failure and the cleanup path. A single probe failed line is not enough.

Successful probe Is Not the End of the Lifecycle

The Linux driver lifecycle does not end when probe succeeds. Probe only brings the device into a usable state. After that come user-space access, interrupts, DMA, workqueues, runtime PM, suspend/resume, remove, and error rollback.

The successful path is easiest to write and easiest to test. Driver stability is decided when initialization fails halfway, removal happens halfway, suspend happens halfway, or user space is still holding references.

When writing each probe step, write down its reverse action. That is the start of avoiding lifecycle bugs.