[openwrt/openwrt] ipq40xx: wpj428: panic on squashfs error to work around boot limbo

Sun Sep 24 09:56:20 PDT 2023

hauke pushed a commit to openwrt/openwrt.git, branch master:
https://git.openwrt.org/98d325aaf8bef992cc92e94feb14fe271d370dc0

commit 98d325aaf8bef992cc92e94feb14fe271d370dc0
Author: Leon M. Busch-George <leon at georgemail.eu>
AuthorDate: Sat Jul 22 10:29:56 2023 +0200

    ipq40xx: wpj428: panic on squashfs error to work around boot limbo
    
    Apparently, a few ipq40xx devices have sporadic problems when reading the
    flash over SPI. When that happens, the result of the faulty SPI read is
    cached and it isn't re-attempted. Depending on when it happens, the router
    either panics and reboots or is left in a partially broken state (an
    application wont start).
    The data on the flash is alright.
    
    This wasn't the case with Openwrt with Linux < 5.x but I wasn't able to
    work out which software change was responsible.
    
    Github user karlpip created a patch for testing that disabled the cache
    entirely and added logs. Typically, only one or two SPI operations fail at
    a time:
    
      [689200.631152] spi-nor spi0.0: SPI transfer failed: -110
      [689200.631280] spi_master spi0: failed to transfer one message from queue
      [689200.635369] jffs2: Write of 68 bytes at 0x00ffccf4 failed. returned -110, retlen 0
      [689200.642014] jffs2: Not marking the space at 0x00ffccf4 as dirty because the flash driver returned retlen zero
    
    Because reads aren't re-attempted, squashfs can't recover:
    
      [3171844.279235] SQUASHFS error: Failed to read block 0x2bb912: -5
      [3171844.279284] SQUASHFS error: Unable to read fragment cache entry [2bb912]
      [3171844.283980] SQUASHFS error: Unable to read page, block 2bb912, size 14e6c
      [3171844.291650] SQUASHFS error: Unable to read fragment cache entry [2bb912]
      [3171844.297831] SQUASHFS error: Unable to read page, block 2bb912, size 14e6c
    
    I assume there to be some kind of underlying electrical problem because,
    in my experience, this happens a lot more when PoE is used.
    
    NoTengoBattery has made an in-depth investigation:
    https://forum.openwrt.org/t/patch-squashfs-data-probably-corrupt/70480
    
    .. and created a patch that evicts the page cache and retries reading:
    https://github.com/NoTengoBattery/openwrt/blob/linksys-ea6350v3-mastertrack/target/linux/ipq40xx/patches-5.4/9996-fs_squashfs_improve_squashfs_error_resistance.patch
    
    The patch also works well with the WPJ428 but NoTengoBattery didn't try to
    upstream it ("This is not the solution that should be used").
    
    In 2020, I tried and failed to create a working patch that prevents faulty pages to
    be cached in the first place. Because I needed a solution, I backported
      "squashfs: add option to panic on errors " (10dde05b89980ef)
    which has since become available in Openwrt.
    
    The 'error=panic' option has been tested on a fleet of multiple hundred
    WPJ428s over multiple years. Without this patch, devices regularly went
    into 'limbo' on reboot or update and required a manual reboot.
    Devices with this patch don't. I was initially concerned that the kernel
    panic would leave devices with a real corrupted data but I haven't seen a
    case of actual corruption since (outside of people turning off the power
    during upgrades).
    
    The WPJ428 is the only device I tested this patch on - others might also
    benefit.
    
    Reviewed-by: Robert Marko <robimarko at gmail.com>
    Signed-off-by: Leon M. Busch-George <leon at georgemail.eu>
---
 .../ipq40xx/files/arch/arm/boot/dts/qcom-ipq4028-wpj428.dts      | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/target/linux/ipq40xx/files/arch/arm/boot/dts/qcom-ipq4028-wpj428.dts b/target/linux/ipq40xx/files/arch/arm/boot/dts/qcom-ipq4028-wpj428.dts
index 48b5cd53d8..d84d54e39b 100644
--- a/target/linux/ipq40xx/files/arch/arm/boot/dts/qcom-ipq4028-wpj428.dts
+++ b/target/linux/ipq40xx/files/arch/arm/boot/dts/qcom-ipq4028-wpj428.dts
@@ -25,6 +25,15 @@
 	model = "Compex WPJ428";
 	compatible = "compex,wpj428";
 
+	chosen {
+		/*
+		 * There's a chance that SPI reads fail even though the data itself is alright.
+		 * The read result is cached and squashfs can't recover.
+		 * Just panic when that happens and hope that next time it doesn't.
+		 */
+		bootargs-append = " rootflags=errors=panic";
+	};
+
 	soc {
 		rng at 22000 {
 			status = "okay";