[PATCH net-next v2 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Lukasz Raczylo
lukasz at raczylo.com
Thu May 14 14:54:56 PDT 2026
Hi netdev, Théo, Andrea, linux-rpi,
v2 of the silent TX stall series. The v1 RFC sits at:
https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/
Reframing first. The v1 cover claimed "zero events post-patch";
that was true at the user-space watchdog visibility level only.
A dmesg sweep prompted by Andrea's review -- with patch 3's warn
made unconditional, per his ask -- revealed kernel-level evidence
that patches 1 and 2 are partial at best. Patch 3 is empirically
the load-bearing fix on this platform: it caught and recovered a
real lost-TCOMP stall on pi-data-02 at 2026-05-05T13:24:09Z
(queue 0, tail=259564431 head=259564433 after ~260M TX, HW
ETHS tx_frames counter advancing through the event while driver
tx_tail did not) without user-space involvement.
So the v2 narrative reads:
* Patch 1 (PCIe posted-write flush) and patch 2 (PCIe read
barrier before descriptor check) close two specific
candidate races in the TSTART / TX_USED paths. Plausible
and well-motivated, but I cannot prove either fires in
isolation on this hardware -- my 1 Hz trace shows TX
freezes, not which mechanism caused them.
* Patch 3 (TX stall watchdog) is the safety net that
empirically does the recovery work. 13 days of production
runtime on 24 nodes since 2026-05-02 in the same form
(anchored against the rpi-6.18.y vendor fork, in
raspberrypi/linux#7340 -- merged 2026-05-08 after review
feedback from pelwell that this v2 incorporates).
The v1 cover's "zero stalls in 95 node-hours of post-patch
uptime" framing was misleading. Apologies for that.
## What changed in v2
Patch 1 (PCIe posted-write flush after TSTART doorbell):
* Gates the readback behind a new MACB_CAPS_PCIE_POSTED_WRITES
capability, set only on raspberrypi_rp1_config. v1
applied the readback to every macb variant; SoC-integrated
parts (Atmel, Microchip, SiFive, Xilinx) have no posted-write
fabric and were paying the readback latency for no benefit.
* Commit message notes that the readback also flushes the
preceding macb_tx_lpi_wake() NCR write on the same path --
not just TSTART -- since it functions as a PCIe read barrier
for all prior posted writes by the same requester.
Patch 2 (PCIe read barrier before TX completion descriptor check):
* Dropped the ISR read. v1 read ISR in macb_tx_poll() with
`queue_readl(queue, ISR) & MACB_BIT(TCOMP)`; that's
destructive on RP1 silicon (MACB_CAPS_ISR_CLEAR_ON_WRITE
is not set on raspberrypi_rp1_config; the existing handler
assumes read-clear semantics and processes every bit
returned from queue_readl(queue, ISR) in one pass). v1's
masked-and-discarded read silently consumed any other bit
set in ISR at that instant -- RCOMP being the worst case
(RX completion never scheduled until the line re-asserts).
* v2 substitutes `(void)queue_readl(queue, IMR)` -- IMR is
the read-only interrupt mask mirror, no side effects, still
flushes prior peripheral DMA writes via PCIe completion
ordering. Loses the "directly sample latched TCOMP" half
of v1's claim; keeps the PCIe-barrier half, which is the
half that addresses the documented race in the existing
macb_tx_complete_pending() rmb() comment.
Patch 3 (TX stall watchdog):
* Tail movement is tracked via a `bool tx_stall_tail_moved`
set by macb_tx_complete() under tx_ptr_lock when tail
advances, and cleared by the watchdog tick on the same lock.
v1 snapshotted tx_tail and compared between ticks; while
that worked correctly given tx_tail is free-running u32,
the bool form is unambiguously cleaner, doesn't depend on
counter behaviour, and is what pelwell asked for when he
reviewed the same series on the rpi side
(raspberrypi/linux#7340).
* netif_carrier_ok() gate added at the top of the watchdog
tick. Eliminates the boot-time false positive seen in v1
where, between macb_open() and link-autoneg-completion,
queue->tx_head can advance from kernel-queued packets while
tx_tail stays at 0 (no TCOMPs yet), tripping the snapshot
check. Observed 6 such fires during a 2026-05-02 fleet
rolling reboot.
* netdev_warn_once -> netdev_warn_ratelimited. v1's
netdev_warn_once made operational accounting impossible
after the first fire on a given netdev; ratelimited keeps
bounded log noise but lets operators count events. Andrea
asked for this directly.
Patches 1 and 3 are independently revertable. Patch 2 v2 is a
two-line readback before an existing check; trivially revertable
in isolation, semantically dependent on the existing
macb_tx_complete_pending() recovery path that it strengthens.
## What I haven't done
* TSO+SG-off canary. rtheobald (cilium#43198 #4188846955)
and the launchpad #2133877 commenter (#34) both report
TSO+SG-off *together* mask the stall; my matrix has TSO+GSO
tested off, not TSO+SG. Happy to canary-test this on one
node if reviewers want the data point before deciding which
of patches 1/2 the SG path actually exercises.
* Per-patch isolation testing. All three deployed
simultaneously on the 24-node fleet; I cannot independently
prove patch 1 or patch 2 does anything on its own. Patch 3
has direct production evidence (lost-TCOMP recovery
described above). If reviewers want a bisection-style
canary I can stagger one-patch / two-patch / three-patch
nodes for >=1 week each.
## Status and testing
* Mainline-anchored: v2 builds clean against current net-next
HEAD, applies cleanly. Boot-tested and brief-sanity in a
canary build before this send.
* raspberrypi/linux rpi-6.18.y anchored equivalents: in
production on 24 nodes since 2026-05-02 (now 13 days); in
raspberrypi/linux master since 2026-05-08 (6 days).
* The v2 patch 2 IMR-barrier form was rolled to all 24 Pi
nodes earlier today (2026-05-14, ~14:00 UTC) as a
vendor-fork-anchored update. ~120 cumulative node-hours
of runtime since: zero mid-runtime TX stalls; zero user-space
watchdog RECOVER events. Cover-letter-thread reply with
detail accompanies this series.
The series does not depend on any other in-flight work. Happy
to split, rebase, drop, or restructure on feedback.
Lukasz Raczylo (3):
net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
net: macb: insert PCIe read barrier before TX completion descriptor
check
net: macb: add TX stall watchdog to recover from lost TCOMP interrupts
drivers/net/ethernet/cadence/macb.h | 14 ++++
drivers/net/ethernet/cadence/macb_main.c | 95 ++++++++++++++++++++++++
2 files changed, 109 insertions(+)
--
2.54.0
More information about the linux-arm-kernel
mailing list