This blog article details how a brutal interruption of a write operation can corrupt a whole filesystem and how a journal provides a protection against such incomplete writes.
3.1 Consequences of uncomplete writes
When you write a file, the underlying filesystem is actually doing two different kinds of write operations: metadata and data writes. For example if you append a log file, the data is written to a filesystem-block and if it gets full then a new block is allocated by updating the inode metadata.
The question is: “What happens if a power outage or a server crash occurs suddenly?”
If the data write was not complete while the metadata was complete, then the last change on the affected file will be only partially saved and some garbage data from older block allocation may also appear in that file.
If the metadata write was not complete, things get more ugly: for example if the disk was just writing a change to a directory index then all files of that directory may become unreadable; a corrupted index may also lead the filesystem to confusion and causes further errors after remounting the filesystem, the whole filesystem may eventually get more and more corrupted.
For this reason, a ‘mount’ command (cf. Ext4 vs. Ext3 Filesystem and why Delayed Allocation is Bad) will mark a filesystem as unclean (except if you mount it readonly) and ‘umount’ will mark it clean. Hence after a crash the filesystem will be unclean and will require an ‘fsck’ command to verify and fix its inconsistency.
Some useful commands:
#see ext3/4 fragmentation: mount -o remount,ro /dev/sdXX #try to remount readonly e2fsck -nvf /dev/sdXX #dry-run without changing the filesystem (many irrelevant errors will be reported if the fs is rw instead of ro) #see ext3/4 flags: tune2fs -l /dev/sdXX #perform a filesystem check: e2fsck -fC0 /dev/sdXX #never do it on a mounted fs, even mounted ro, else it may cause a kernel panic
3.2 The journal provides atomic writes
To prevent damage on metadata and corruption of the filesystem, the write operation of metadata can be made in two steps: instead of being done directly to the blocks it is first committed to the journal and then to the allocated blocks. With this strategy, we can guarantee that the write will be atomic (all or nothing): if a crash occurred before the write was complete on the journal then the write will be simply be dropped after the reboot; and if a crash occurred after the journal commit then the write can be replayed.
Ext3 and ext4 use journalized writes with 3 different mount options (cf. ‘man mount’):
- ‘data=writeback’: only metadata are written to the journal; it guarantees the consistency of the filesystem with a minimal performance cost (metadata need to be written twice); data and metadata are written concurrently, hence after a crash the file mays contain some garbage data from older block allocation, e.g. “abcdef” -> “ABCDEFGH” may end up as “ABCdefXY” (XY being some garbage); data from an older file belonging to another user appearing after the crash is also a security problem.
- ‘data=ordered’: it is the default and recommended settings; compared to ‘data= writeback’, it is slightly slower but without its security hole after a crash and it can make data changes atomic as well (cf. Ext4 vs. Ext3 Filesystem) ; it is similar to ‘data=writeback’ but instead of writing data and metadata asynchronously it writes sequentially the metadata to the journal after the data has been written to disk, hence after a crash the file will never contain some garbage data from older block allocation but it may be incomplete; e.g. “abcdef” -> “ABCDEFGH” may end up as “ABCdef”.
- ‘data=journal’: the safest and slowest mode, here metadata and data are written to the journal, it guarantees the consistency of the filesystem and of the file with a 50% write performance drop; this mode work like an ACID database (Redologs of Oracle or Write Ahead Logs of PostgreSQL).
Some useful commands:
#'http://fenidik.blogspot.ch/2010/03/ext4-disable-journal.html' # Create ext4 fs on /dev/sdXX disk mkfs.ext4 /dev/sdXX # Enable writeback mode. This mode will typically provide the best ext4 performance. tune2fs -o journal_data_writeback /dev/sdXX # Delete has_journal option tune2fs -O ^has_journal /dev/sdXX #or use mkfs.ext4 -O ^has_journal /dev/sdXX # Required fsck e2fsck -f /dev/sdXX # Check fs options dumpe2fs /dev/sdXX | grep features tune2fs -l /dev/sdXX #For more performance add fstab options: /dev/sdXX /opt ext4 defaults,data=writeback,noatime,nodiratime 0 0