[RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1

Lukasz Raczylo lukasz at raczylo.com
Sat Apr 25 14:48:25 PDT 2026


A follow-up runtime data point on this series.

Fleet state at 2026-04-25 21:46 UTC:

  * Patched uptime (since staggered rollout 2026-04-24 18:10-19:20 UTC):
    - shortest: 26h 26m   (last master upgraded)
    - longest:  27h 34m   (canary)
    - cumulative across 24 nodes: ~651 node-hours

  * Macb-attributable event counts (out-of-band userspace watchdog;
    the [tx-stall] detector watches /sys/class/net/end0/statistics/
    tx_packets + qdisc backlog every 1 s and would have fired
    ip link down/up if any node's TX path froze):
    - RECOVER trigger=tx-stall (actual stalls caught):    0
    - partial [tx-stall] markers (transient 1 s freezes): 0

  * Separately: 40 RECOVER events with trigger=ping fired in this
    window across the fleet, attributable to a brief upstream-network
    outage (gateway / switch event); each node simultaneously lost ping
    to gateway, VIP, and NAS within seconds of each other, then
    recovered.  These are unrelated to the macb hang the patch series
    targets — distinguishing them from a real TX stall is exactly what
    the trigger= tag in the watchdog log is for.

At the pre-patch rate referenced in the cover letter (50 stalls in
95 node-hours observed in our 2026-04-24 14:00-18:10 UTC reference
window, ~0.5 per node-hour), the projected stall count in
651 node-hours is on the order of 342;
observed is 0.

Same observability runs forward; will reply again after a full week
of uptime unless something changes.

--
Lukasz Raczylo



More information about the linux-arm-kernel mailing list