Linux

Why Linux Driver probe Failure Paths Are More Bug-Prone Than Successful probe

7 minute

Driver debugging often focuses on the successful probe path: acquire resources, map registers, request interrupt, initialize hardware, register user-space interface, then print “probe ok”.

Field bugs often hide on another path:

probe fails halfway and does not roll back cleanly
probe returns an error while IRQ or workqueue is still active
user space keeps an fd after remove
runtime suspend is powering down while the error path releases resources
DMA buffer is freed while the device is still writing
devm_ is used, but object lifetime is not what the driver expected

The hard part is not initializing hardware once. The hard part is: if any step fails, the device is removed, the module unloads, the system suspends, or user space still holds an fd, the driver must stop everything it has already started in the right order.

Why Linux Driver Debugging Is More Than printk

6 minute

The first tool many driver developers reach for is printk. Probe does not run, print a line. Interrupts do not arrive, print a line. DMA does not move, print a line. User space cannot read data, print another line.

That works, but only up to a point:

too many logs hide the real failure
high-frequency logs slow the system down
printing in interrupt or locked paths changes timing
production images cannot keep verbose logs enabled
an intermittent bug disappears after adding logs
multi-instance devices are hard to distinguish

A better model is layered instrumentation:

What Should sysfs, debugfs, and procfs Expose?

6 minute

Linux drivers often expose information through text files in addition to /dev/xxx and ioctl:

/sys/...
/sys/kernel/debug/...
/proc/...

They all support cat and echo, so it is tempting to place state, configuration, debug knobs, and statistics wherever convenient. That convenience becomes interface debt: test scripts depend on debugfs, product applications parse procfs, sysfs formats cannot be changed, and field tools do not know which interface is stable.

A practical boundary is:

Why Kernel Configuration, Modules, and Boot Parameters Change Device Behavior

7 minute

Some embedded Linux failures are easy to misread:

the same rootfs boots with one kernel but not another
a driver module exists in rootfs, but the device is unavailable during boot
no serial log appears, so the system looks dead
after changing bootloader variables, the system mounts the wrong partition
the application is unchanged, but device nodes appear in a different order

These problems are not always application or rootfs problems. Device behavior is strongly shaped by three things:

Why Embedded Linux Images Split boot, rootfs, and data

7 minute

An embedded Linux device that can boot is not necessarily ready for product deployment.

During development, putting the bootloader, kernel, dtb, rootfs, application, and data together may work. The problems appear during updates, factory reset, abnormal power loss, partition damage, and field repair:

a rootfs update overwrites user data
the device tree changes but the bootloader still loads the old dtb
a full data partition breaks system services
rootfs is damaged and there is no recovery path
power loss during OTA leaves no bootable system
factory reset removes too much or too little

The point of partition layout is not simply “more partitions”. It separates the boot chain from data lifetimes:

Why Embedded Linux Often Uses a Read-Only rootfs

7 minute

During development, many embedded Linux systems use a writable ext4 rootfs. It is convenient: copy missing files, edit configuration, and write logs under /var/log.

In a product, that convenience turns into risk:

power loss can corrupt system files
temporary files, logs, and databases get mixed into rootfs
updates cannot easily tell user changes from system files
factory reset has unclear boundaries
the system partition becomes dirty, making field issues hard to reproduce

The point of a read-only rootfs is not simply to prevent changes. It is to make boundaries explicit: system files should be verifiable and recoverable; runtime state should be disposable; user data should have a clear home; updates and factory reset should not guess which files matter.

What Problems Do Buildroot and Yocto Actually Solve?

7 minute

Many embedded Linux projects start by assembling a system by hand: download a cross compiler, build the kernel, copy BusyBox, add libraries, package rootfs, and drop in the application.

That may boot a board, but it quickly breaks down in a product:

which cross toolchain built this image
which kernel config and device tree were used
where each library in rootfs came from
why the same application links differently on another machine
whether production images, debug images, and SDKs can be generated from one configuration
whether the same system can be rebuilt six months later

Buildroot and Yocto do not mainly solve “how to install a few packages”. They solve this: how to put the toolchain, bootloader, kernel, rootfs, packages, image, and SDK into a repeatable build system.

Why systemd and udev Affect Device Application Startup

7 minute

On embedded Linux devices, applications often fail only during boot: a serial port has not appeared, the NIC has no address yet, the data partition is not mounted, the camera node is missing, or permissions have not been applied.

The common patch is:

ExecStartPre=/bin/sleep 5

It may appear to work, but it turns dependencies into a timing guess. If storage enumeration is slower, DHCP takes longer, or a driver is probed later after -EPROBE_DEFER, the failure comes back.

Why rootfs Decides Whether Linux Can Enter User Space

7 minute

The serial log may show a long Linux kernel boot, then end in a panic. The kernel may appear to boot, but the product application never starts. The log may say No working init found or VFS: Unable to mount root fs.

These failures are often treated as application startup problems. But before any application can run, Linux has to cross a more basic boundary: the kernel must find and mount rootfs, then execute the first user-space process.

How runtime PM Differs From suspend/resume in Linux Drivers

7 minute

Low-power bugs often look intermittent: the first I/O after idle fails, interrupts disappear after wakeup, /dev still exists but hardware does not respond, or power never goes down.

These problems often come from the Linux driver power-management state machine.

Embedded Linux drivers commonly face two paths: runtime PM and system suspend/resume. Both save power, but they solve different problems.

The safest first model is this: runtime PM handles per-device idle power saving while the system is running; system suspend/resume handles whole-system sleep and wakeup. A driver must keep I/O, resources, wakeup sources, and state restore consistent in both paths.