aarch64: ext4 metadata integrity regression in kernels >= 5.5 ?

Tue Oct 27 15:51:46 EDT 2020

On Sun, Jul 12, 2020 at 11:07:39AM +0100, Russell King - ARM Linux admin wrote:
> On Sun, Jul 12, 2020 at 10:22:31AM +0100, Russell King - ARM Linux admin wrote:
> > Some will know that during the last six months, I've been seeing
> > problems on the LX2160A rev 1 with corrupted checksums on a EXT4
> > FS on a NVMe recently.  I'm not certain exactly which kernels are
> > affected, but I know that 5.1 seems to be fine, and 5.5, possibly
> > 5.4 onwards seem affected, maybe earlier.
> > 
> > The symptom is that the kernel will run for some random amount of
> > time (between a few days and a few months) and then EXT4 will
> > complain with "iget: checksum invalid" on the root filesystem either
> > during a logrotate or a mandb rebuild.
> > 
> > Upon investigation with debugfs and hexdump, it appeared that a single
> > EXT4 inode in one sector contained an invalid 32-bit checksum.  EXT4
> > splits the 32-bit checksum into two 16-bit halves and stores them in
> > separate locations in the inode, consequently any read or update of
> > the checksum requires two separate reads or writes.
> > 
> > The problem initially seemed to correlate with powering the platform
> > down as the trigger, and it was suggested that the NVMe was at fault.
> > However, a recent case disproved that theory when the problem appeared
> > to self-correct itself after using "hdparm -f" on the drive, and the
> > problem going away - e2fsck found no errors on the filesystem, and I
> > could remount the filesystem in read/write mode.  "hdparm -f" syncs
> > the device and flushes the kernel cache, which it also does when you
> > use "hdparm -t" to measure disk performance.
> > 
> > My next question was whether it was being caused by PCIe ordering
> > issues.  I've since upgraded the machine to a LX2160A rev 2, which has
> > yet to show any symptoms of this.
> > 
> > However, the reason for this email is a troubling development with this
> > problem:
> > 
> > [7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
> > [7478798.729925] Aborting journal on device mmcblk0p1-8.
> > [7478798.734070] EXT4-fs (mmcblk0p1): Remounting filesystem read-only
> > [7478798.734589] EXT4-fs error (device mmcblk0p1): ext4_journal_check_start:84: Detected aborted journal
> > 
> > Running "e2fsck -n" on the system without having done anything gives:
> > 
> > Inode 13755 passes checks, but checksum does not match inode.  Fix? no
> > Inode 157096 passes checks, but checksum does not match inode.  Fix? no
> > 
> > amongst other errors, which are expected for a filesystem that is
> > normally "in-use".  Using "hdparm -f" does not make these errors go
> > away.
> > 
> > The offending inodes found by e2fsck corresponds with:
> >   /usr/share/man/nl/man1/apt-transport-mirror.1.gz
> >   /lib/firmware/rtl_bt/rtl8723a_fw.bin
> > 
> > However, just like all the other instances, these would not have changed
> > recently except for atime updates.
> > 
> > There are a couple of important differences here:
> > - It is an Armada 8040 system - Clearfog GT-8K running a 5.6 kernel,
> >   rather than the LX2160A.
> > - Its rootfs is on eMMC, not NVMe.
> > 
> > That seems to rule out the NVMe being a cause of the problem, and any
> > PCIe issues of the LX2160A rev 1.
> > 
> > Another data point is that I'm also running an Armada 8040 system as a
> > VM host, which has over a year uptime, so is on an older kernel (5.1).
> > This uses EXT4 for its rootfs as well, but is on SATA SSD, and has not
> > shown any issues.  The VMs it runs are a later kernel (5.6) also with
> > EXT4, and have yet to display any symptoms.
> > 
> > The similarities are - the kernel is the same or similar binary on the
> > failing systems (I've been running the same kernel config on both.)
> > Both are a Cortex-A72, but slightly different revisions.
> > 
> > So, it's starting to feel like an aarch64 problem, potentially a
> > locking or ordering issue.  Due to how rare this issue is,
> > investigating it is likely very difficult.  However, it seems to be
> > very real, as the symptoms have now been observed on two rather
> > different aarch64 platforms.
> > 
> > Due to the amount of time required to test, it very difficult to do any
> > kind of bisection, or test alternative kernels - it would take months
> > of runtime for a single test.
> > 
> > I'm chucking this out there so that if anyone else is seeing this
> > behaviour, they can shout and maybe confirm what I'm seeing.
> 
> A bit more information:
> 
> Inode 157096 is /usr/share/man/nl/man1/apt-transport-mirror.1.gz:
> 
> --- bad
> +++ fixed
>  debugfs:  stat <157096>
>  Inode: 157096   Type: regular    Mode:  0644   Flags: 0x80000
>  Generation: 3717235945    Version: 0x00000000:00000001
>  User:     0   Group:     0   Project:     0   Size: 3811
>  File ACL: 0
>  Links: 1   Blockcount: 8
>  Fragment:  Address: 0    Number: 0    Size: 0
>   ctime: 0x5ebcd62f:ba34bf1c -- Thu May 14 06:25:03 2020
>   atime: 0x5ebcd63b:a2906fa0 -- Thu May 14 06:25:15 2020
>   mtime: 0x5eba730a:00000000 -- Tue May 12 10:57:30 2020
>  crtime: 0x5ebcd62f:a25cccf4 -- Thu May 14 06:25:03 2020
>  Size of extra inode fields: 32
> -Inode checksum: 0x13fd5c3c
> +Inode checksum: 0x600eba80
>  EXTENTS:
>  (0):1173965
> 
> Note that mandb is set to run daily, so one must assume that the
> inode checksum was fine the previous day.  Note that the file itself
> is fine - it passes gzip's integrity checks, and the contents are
> correct:
> 
> # zcat /usr/share/man/nl/man1/apt-transport-mirror.1.gz >/dev/null
> 
> For the other inode, 13755, /lib/firmware/rtl_bt/rtl8723a_fw.bin:
> 
> --- bad
> +++ fixed
>  debugfs:  stat <13755>
>  Inode: 13755   Type: regular    Mode:  0644   Flags: 0x80000
>  Generation: 2326028864    Version: 0x00000000:00000001
>  User:     0   Group:     0   Project:     0   Size: 24548
>  File ACL: 0
>  Links: 1   Blockcount: 48
>  Fragment:  Address: 0    Number: 0    Size: 0
>   ctime: 0x5e88ffc5:b9a541e4 -- Sat Apr  4 22:44:37 2020
>   atime: 0x5e88ffc4:00000000 -- Sat Apr  4 22:44:36 2020
>   mtime: 0x5d5f3bb0:00000000 -- Fri Aug 23 02:04:48 2019
>  crtime: 0x5e88ffc5:51b03564 -- Sat Apr  4 22:44:37 2020
>  Size of extra inode fields: 32
> -Inode checksum: 0x4d9c9f81
> +Inode checksum: 0x487c2bf3
>  EXTENTS:
>  (0-5):835670-835675
> 
> In both cases, the times suggest that there has been no change made to
> these inode recently.
> 
> It would have been great to know the state of these inodes prior to the
> checksum not matching, but alas, time travel has yet to be invented!
> Maybe if/when it happens again on the Armada 8040, I'll have an ext4fs
> image to compare against - and hopefully identify exactly what has
> changed.

The problems have persisted up until I added some additional debug to
the ext4 code, and so far the Armada 8040 system has been up for 58
days without incident. This suggests that it is a subtle timing bug,
which is going to be nigh on impossible to debug. Unfortunately, it
means that I just can't trust recent aarch64 kernels not to corrupt
my filesystems, and I certainly can't trust them to run any of my
critical systems.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!