Blog

Linux Filesystems, Part 4 – Ext4 vs. Ext3 and why Delayed Allocation is Bad

It covers the main differences between ext3 and ext4 with a focus on filesystem consistency. This article was the initial motivation of this blog series, because many engineers are unaware that the standard option of ext4 (delalloc) is dangerous for their data!

4.1 Main differences

Ext3 was available since 2001 with Linux Kernel 2.4.15 and extended ext2 with journalling to avoid filesystem corruption after a crash.

  • Maximum individual file size can be from 16 GiB to 2 TiB.
  • Overall ext3 file system size can be from 2 TiB to 32 TiB.
  • 32’000 subdirectories per directory.
  • You can convert an ext2 file system to ext3 file system directly (without backup/restore).

 

Ext4 was introduced in 2008 with Linux Kernel 2.6.19 to replace ext3 and overcomes its limitations.

  • Supports huge individual file size and overall file system size.
  • Maximum individual file size can be from 16 GiB to 16 TiB.
  • Overall maximum ext4 file system size is 1 EiB (exabyte). = 1’024 PiB (petabyte) = 1’048’576 TB (terabyte).
  • 64’000 subdirectories per directory.
  • You can also mount an existing ext3 fs as ext4 fs (without having to upgrade it).
  • In ext4, you also have the option of disabling the journaling feature.

(cf. http://www.thegeekstuff.com/2011/05/ext2-ext3-ext4/)
After reading some internet reviews, the general conclusion is that the ext4 filesystem brings some performance improvements over ext3 but it has also dangerous default options.

 

A good thing with ext4 is that ‘fsck’ is always very fast; on the other hand, ‘fsck’ on a big ext3 with a huge amount of files can take hours of downtime to complete.
Other new features of ext4 include journal checksum, pre-allocation of blocks and extents to reserve an adjacent list of block, helping to reduce fragmentation.
It also includes delayed allocation to improve write performance; but this has dangerous consequences (more on this below).

 

It is important to know that a power failure or a server crash can lead to data loss and corruption on an out of the box ext4 filesystem.
With the correct options ext4 can be as safe as ext3, but then the performance advantage may be much smaller.

4.2 Ext3: rename-idiom provides atomic writes of data without writing it twice

In order to benefit from atomic writes for metadata and data without to pay the cost of the option ‘data=journal’ many applications use this rename-idiom:

  • fd=open(“file.new”); write(fd, data); close(fd); rename(“file.new”, “file”);

With ext3 and ‘data=ordered’ we have then the guarantee that ‘file’ is either completely changed or unchanged (all or nothing); we roughly benefit from the atomicity at the data level without killing the performance with ‘data=journal’. Ext3 has also ‘commit=5′ as default, which force all data and metadata from the write cache to be send to the disk every 5 seconds. This can be done manually with the command ‘sync’. Good applications like PostgreSQL will not rely on default settings that are filesystem dependent and will trigger themselves a call to ‘fsync()’.
In conclusion, ext3 with default options guarantees after a crash the atomicity and consistency of the filesystem changes (all metadata) and also of files changes (data) for the rename-idiom, with a maximum loss of 5 seconds on metadata and data changes.

4.3 Ext4: delayed allocation will destroy your data

Now with ext4, the default is still to use ‘commit=5,data=ordered’ but unfortunately the safe guarantees of ext3 are not anymore valid.
To improve the performance, ext4 uses per default delayed allocation of blocks so that they only get allocated when the data is committed to disk. The rename-idiom is therefore not anymore working as many applications expect: the ‘rename()’ call changes only the metadata and may happens before the data are written to disk and as a consequence the resulting file after a crash is empty!
For ext3 this problem would only arise with ‘data=writeback’ but here on ext4 it also happens with ‘data=ordered’, because for the ext4′s author ‘data=ordered’ does not mean all data but only the data with allocated blocks… So it works when the application overwrites a file but not when the file is enlarged with more blocks…
So on ext4 ‘data=ordered’ and ‘data=writeback’ are somehow similar when a file is enlarged, which is quite confusing and not clearly stated in the man pages. And the delayed allocation will actually commit the data to disk only after 30-150 seconds (it is not very clear on this exact window of data loss) even when ‘commit=5′ is supposed (cf. ‘man mount’) to do it after 5 seconds.
In conclusion, ext4 with default options guarantees after a crash only the atomicity and consistency of the filesystem changes (all metadata) with a maximum loss of 5 seconds on metadata changes. The data changes may suffer a loss of 30-150 seconds and in the majority of cases all changed files in this window will be completely wiped with zero bytes! The atomicity of file changes is not working anymore with the rename-idiom.

This dramatic situation has caused a lot of anger and ext4′s author has argued that the guarantees of safety provided by ext3 where unnecessary from a POSIX point of view and that the solution was to fix all the “broken” applications (including GNU fileutils like ‘mv’) because they should call explicitely ‘fsync()’ before each ‘rename()’… (Note: ‘fsync()’ flushes the filesystem write cache to disk and ‘flush()’ flushes the file buffers to the filesystem write cache.)

But calling ‘fsync()’ would kill the performances on ext3, and fixing 100’000 applications instead of fixing 1 filesystem is not practicable.

4.4 Linus Torvalds’s angry reaction

This leads Linus Torvalds to react in following posts:

http://lwn.net/Articles/326342/

[Mr Ts'o shows considerable arrogance saying that virtually every application on the planet is "badly written" (including GNU fileutils, meaning most frequently used OS tools such as mv). He also seems unaware of what we might call "Hot topics in filesystem design", such as: "POSIX is not the bible of reliability it was never supposed to be" or "Users dislike empty files".

This dangerous combination of arrogance and ignorance is leading Mr Ts'o to quickly damage ext4 reputation and place it next to XFS in users minds, and we all know how hard it is to revert that kind of reputation. This may leave Linux users in many years to come between a rock and a hard place when it comes to filesystem performance: use the obsolete and slow ext3, or suffer the consequences of repeated slow fsync() calls in the much-needed ext4.

> Try ext4, I think you'll like it. :-) Failing that, data=writeback for single-user machines is probably your best bet.

Isn't that the same fix? ext4 just defaults to the crappy "writeback" behavior, which is insane. Sure, it makes things _much_ smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that's a solution is just incompetent. We might as well go back to ext2 then. If your data gets written out long after the metadata hit the disk, you are going to hit all kinds of bad issues if the machine ever goes down. Linus]

http://lwn.net/Articles/322823/

[Are we really saying that ext4 commits metadata changes to disk (potentially a long time) before committing the corresponding data change? That surely can't be right. Why on earth would you write metadata describing something which you know doesn't exist yet - and may never exist? Especially when the existing metadata describes something that does.]

Cf. also explanation of ext4′s author at http://ostatic.com/blog/recent-bug-report-details-data-loss-in-ext4-tso-explains-cause-and-workarounds.

The author of ext4 has then developed new option to fix the problem (cf. http://lwn.net/Articles/476478/):

  • ‘nodelalloc’ disables the delayed allocation completely
  • ‘auto_da_alloc’ tries to detect the rename-idiom and force the block allocation a data write prior to the ‘rename()’

4.5 The solution

In conclusion, ext4 should be mounted with ‘-o nodelalloc’ to make it safe against a server crash and ext3 should use ‘-o barrier=1′ (barriers are disabled by default on ext3).

4.6 A new serious problem of ext4

Ext4 offer in theory increased safety if you additionally use the options ‘-o journal_checksum,journal_async_commit’ but in October 2012 a serious bug was discovered that can lead to filesystem corruption with these options. Ext4′s author called if the “Lance Armstrong bug”: when the code never fails a test, but evidence shows it’s not behaving as it should (cf. http://forums.opensuse.org/english/other-forums/news-announcements/tech-news/479881-stable-linux-kernel-hit-ext4-data-corruption-bug.html – post2499841).
The author of ext4 wrote then that these two options were experimental and dangerous and should not be used. It is hard to understand why this warning was not mentioned earlier, knowing that ext4 was used in stable Linux distro since 2009.
Despite this unfortunate stories, we should not think that ext4 is a bad filesystem. It definitely improves many things over ext3 and is suits better large partitions with a huge amount of files, the benefit is especially appreciable when you are doing a fsck.
But it is important to know exactly which dangers exist with ext4 and how to overcome them by using the right mounting options.

Back to Main Page

Comments

  1. James

    Is disabling delayed allocation tested well? Certain mount options have been buggy on EXT4.
    https://lkml.org/lkml/2013/8/2/377

    Remember BTRFS and most modern filesystems, other than ext3 and JFS and reiserfs , have delayed allocation.

    • Francois Scheurer

      Hi James!

      Thank you for your pointer.

      I knew that journal_checksum and journal_async_commit were buggy, but never heard about a bug with nodelalloc.
      In fact, without nodelalloc there is a real risk during a crash that changed files get zeroed.

      Regarding your bug report, it does not relate to the code of nodelalloc but to the validation of the mount options. see:
      Commit 26092bf (“ext4: use a table-driven handler for mount options”)
      wrongly disallows the specifying the mount options nodelalloc and
      data=journal simultaneously. This is incorrect; it should have only
      disallowed the combination of delalloc and data=journal
      simultaneously.

Leave a Reply

Your email address will not be published. Required fields are marked *