Why Filesystems Fear Sudden Power Loss

Reading time: 7 minute Word count: 1440

Operating Systems Filesystems Power Loss Storage Flash

Many field failures end with the same sentence: the device lost power right after writing configuration, and after reboot the file was damaged.

The application clearly called write(), and it even returned success. The filesystem may not be completely broken, but a config file becomes empty, a log tail is garbage, a database rolls back, or an update package fails verification.

This is often misunderstood as “the filesystem is unreliable.” A more accurate view is: filesystems trade off performance, lifetime, and consistency; applications must also define whether they need write return, storage persistence, or a complete business update.

The safest first model is this: a file write passes through application buffering, kernel page cache, filesystem metadata, block device queues, and storage media. write() success means data reached one layer, not necessarily that it is safely and completely on nonvolatile storage with the intended business meaning.

application write
-> C library buffering
-> kernel page cache
-> filesystem data block and metadata allocation
-> block device / flash driver
-> storage media actually completes write

When power fails, the result depends on where the write path stopped.

write Success Does Not Mean Durable Storage

Many applications treat successful write() as “the data is on storage.” That is not accurate enough.

For performance, operating systems often put writes into the in-memory page cache first and write them to storage later in batches. This merges small writes, reduces random I/O, and improves throughput.

So write() success often means:

arguments were valid
data was copied from user space to the kernel
the kernel accepted the write
the in-memory file state was updated

It does not necessarily mean:

data reached the storage chip
file metadata was synced
directory entry was persisted
device internal write cache completed
the new content can be read after power loss

If an application needs data to survive power loss, it usually needs fsync(), fdatasync(), directory sync, atomic replacement, and storage cache flush behavior to be considered.

File Data and Metadata Are Different

A file is not only content. It also has metadata.

Metadata includes:

file size
permissions
timestamps
data block locations
inode information
directory entries

File data and metadata may be written to storage at different times.

For example, creating and writing a new file involves:

create directory entry
allocate inode
allocate data blocks
write file content
update file size
update directory and inode metadata

Power can fail between any of these steps. Results may include:

content was written, but directory entry was not persisted, so the file disappears
directory entry exists, but file size is old
file size updated, but some data blocks were not written
old and new data are mixed

Power-loss consistency is not only “did the content get written.” The filesystem also needs data blocks, metadata, and directory structure to recover consistently.

Journaling Mainly Protects Metadata Consistency

Many general-purpose filesystems use journaling.

The basic idea is: before modifying filesystem structures, record the intended metadata updates in a journal. After power loss, the filesystem can replay or discard incomplete operations so the structure returns to a consistent state.

This prevents many severe problems:

inode points to unallocated blocks
free-space bitmap disagrees with actual use
directory structure is corrupted
filesystem requires long full-volume scan

But journaling usually prioritizes filesystem structural consistency. It does not necessarily guarantee application data semantics.

For example, if a configuration file is overwritten halfway, the filesystem may recover with a consistent structure, while the config content is still half-new, half-old, empty, or otherwise invalid.

So journaling is not an application-level transaction. It helps the filesystem recover; it does not automatically make your business update atomic.

Why Overwrite-in-Place Is Dangerous

Many applications save configuration by opening the original file and overwriting it:

open config
truncate to zero
write new content
close

If power fails after truncate and before the write completes, reboot may see an empty file or half a file.

A safer pattern is usually: write a temporary file, then atomically replace:

write config.tmp
fsync config.tmp
rename config.tmp -> config
fsync directory

The key is rename. Within one filesystem, rename usually provides atomic directory-entry replacement: after reboot, you should see either the old file or the new file, not a half-renamed state.

But that is not enough. The temporary file must be fsynced so its content is durable. After rename, the directory also needs to be synced so the directory entry replacement itself is durable.

Many “I used a temp file but still lost config” bugs miss file or directory sync.

What fsync Actually Guarantees

fsync(fd) aims to synchronize file data and required metadata to storage so the file can recover to that state after power loss.

Engineering boundaries matter:

fsync on a file does not necessarily sync the parent directory entry
fdatasync may sync only data and required metadata
the storage device may have its own write cache
actual flush behavior depends on hardware reliability
filesystem mount options affect ordering and journaling behavior

If a file is newly created or replaced by rename, parent directory persistence is also important. The content may be durable while the directory update is not.

So when you say “we called fsync,” ask:

which fd was synced
whether the directory was synced
whether the storage device actually completed flush
whether the business update spans multiple files

fsync is a persistence tool, not an automatic business transaction.

Flash Adds Erase and Wear Problems

Embedded devices often use flash. Flash is not ordinary memory. It usually cannot freely change a byte from 0 back to 1; it must erase by block and program by page.

This creates consequences:

small file updates may trigger larger erase/write operations
write latency may be unstable
erase cycles are limited, requiring wear leveling
power may fail during erase, move, or writeback
FTL or filesystem mapping tables may also need updates

Raw NAND/NOR, eMMC, SD card, UFS, SPI NOR, and SPI NAND have different power-loss behavior and reliability.

Some devices have internal controllers and FTLs that map logical blocks to physical flash. Some systems use flash-oriented filesystems such as JFFS2, UBIFS, or LittleFS. Each choice handles power-loss recovery, wear leveling, and write amplification differently.

So on embedded devices, “write a file” may involve erase, relocation, mapping-table updates, and bad-block management. Sudden power loss during these internal steps exposes whether the consistency design is sound.

Why Databases and Logs Still Implement Transactions

If the filesystem has a journal, why do databases, configuration systems, and update systems still implement their own transactions?

Because filesystem journals usually do not know application semantics.

They know filesystem structures such as inodes, directory entries, and block allocation. They do not know:

these three files must update together
configuration must satisfy a checksum
database pages have version relationships
the boot partition must not switch before the update package is complete
log records must be replayable in business order

Reliable systems often add an application-level protocol:

dual configuration copies and version numbers
checksum or CRC
temporary file plus atomic rename
write-ahead log
transaction commit marker
A/B partition update
recovery or rollback at startup

The filesystem keeps lower-level structure consistent. Application protocol makes business state decidable and recoverable.

What to Check After Power-Loss Corruption

When files disappear, configuration becomes empty, a database is damaged, or an update fails after power loss, check these layers.

First, does the application treat write() return as success? Is there fsync or equivalent persistence?

Second, does it overwrite the original file? Is there a truncate-then-write window?

Third, does it use a temporary file and atomic rename? Are both the temp file and directory synced?

Fourth, does business state span multiple files? Is there a commit marker, version number, or recovery logic?

Fifth, is the storage reliable? Does it have write cache, power-loss protection, hold-up time, and correct flush support?

Sixth, does the flash filesystem fit the workload? Do write frequency, erase-block size, wear leveling, and power-loss recovery match?

Seventh, does the test really simulate power loss? Pull timing, load, and power hold-up capacitors affect reproduction.

These questions are closer to the root cause than “the filesystem is unreliable.”

What to Remember in Practice

Filesystems fear sudden power loss because a write is not completed in one instant.

One file update may cross:

application buffering
kernel page cache
data block writes
metadata updates
filesystem journal
block device queue
storage device cache
flash erase and writeback

write() success does not mean business data is safely durable. A journaling filesystem can protect filesystem structure without automatically protecting application data semantics.

If data must survive power loss, treat file update as a protocol: write a temp file, sync content, atomically replace, sync the directory, add checksums, keep old versions, and make startup recovery explicit.