[PATCH net-next v2 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1

Lukasz Raczylo lukasz at raczylo.com
Thu May 14 14:54:56 PDT 2026


Hi netdev, Théo, Andrea, linux-rpi,

v2 of the silent TX stall series.  The v1 RFC sits at:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Reframing first.  The v1 cover claimed "zero events post-patch";
that was true at the user-space watchdog visibility level only.
A dmesg sweep prompted by Andrea's review -- with patch 3's warn
made unconditional, per his ask -- revealed kernel-level evidence
that patches 1 and 2 are partial at best.  Patch 3 is empirically
the load-bearing fix on this platform: it caught and recovered a
real lost-TCOMP stall on pi-data-02 at 2026-05-05T13:24:09Z
(queue 0, tail=259564431 head=259564433 after ~260M TX, HW
ETHS tx_frames counter advancing through the event while driver
tx_tail did not) without user-space involvement.

So the v2 narrative reads:

  * Patch 1 (PCIe posted-write flush) and patch 2 (PCIe read
    barrier before descriptor check) close two specific
    candidate races in the TSTART / TX_USED paths.  Plausible
    and well-motivated, but I cannot prove either fires in
    isolation on this hardware -- my 1 Hz trace shows TX
    freezes, not which mechanism caused them.

  * Patch 3 (TX stall watchdog) is the safety net that
    empirically does the recovery work.  13 days of production
    runtime on 24 nodes since 2026-05-02 in the same form
    (anchored against the rpi-6.18.y vendor fork, in
    raspberrypi/linux#7340 -- merged 2026-05-08 after review
    feedback from pelwell that this v2 incorporates).

The v1 cover's "zero stalls in 95 node-hours of post-patch
uptime" framing was misleading.  Apologies for that.

## What changed in v2

Patch 1 (PCIe posted-write flush after TSTART doorbell):
  * Gates the readback behind a new MACB_CAPS_PCIE_POSTED_WRITES
    capability, set only on raspberrypi_rp1_config.  v1
    applied the readback to every macb variant; SoC-integrated
    parts (Atmel, Microchip, SiFive, Xilinx) have no posted-write
    fabric and were paying the readback latency for no benefit.
  * Commit message notes that the readback also flushes the
    preceding macb_tx_lpi_wake() NCR write on the same path --
    not just TSTART -- since it functions as a PCIe read barrier
    for all prior posted writes by the same requester.

Patch 2 (PCIe read barrier before TX completion descriptor check):
  * Dropped the ISR read.  v1 read ISR in macb_tx_poll() with
    `queue_readl(queue, ISR) & MACB_BIT(TCOMP)`; that's
    destructive on RP1 silicon (MACB_CAPS_ISR_CLEAR_ON_WRITE
    is not set on raspberrypi_rp1_config; the existing handler
    assumes read-clear semantics and processes every bit
    returned from queue_readl(queue, ISR) in one pass).  v1's
    masked-and-discarded read silently consumed any other bit
    set in ISR at that instant -- RCOMP being the worst case
    (RX completion never scheduled until the line re-asserts).
  * v2 substitutes `(void)queue_readl(queue, IMR)` -- IMR is
    the read-only interrupt mask mirror, no side effects, still
    flushes prior peripheral DMA writes via PCIe completion
    ordering.  Loses the "directly sample latched TCOMP" half
    of v1's claim; keeps the PCIe-barrier half, which is the
    half that addresses the documented race in the existing
    macb_tx_complete_pending() rmb() comment.

Patch 3 (TX stall watchdog):
  * Tail movement is tracked via a `bool tx_stall_tail_moved`
    set by macb_tx_complete() under tx_ptr_lock when tail
    advances, and cleared by the watchdog tick on the same lock.
    v1 snapshotted tx_tail and compared between ticks; while
    that worked correctly given tx_tail is free-running u32,
    the bool form is unambiguously cleaner, doesn't depend on
    counter behaviour, and is what pelwell asked for when he
    reviewed the same series on the rpi side
    (raspberrypi/linux#7340).
  * netif_carrier_ok() gate added at the top of the watchdog
    tick.  Eliminates the boot-time false positive seen in v1
    where, between macb_open() and link-autoneg-completion,
    queue->tx_head can advance from kernel-queued packets while
    tx_tail stays at 0 (no TCOMPs yet), tripping the snapshot
    check.  Observed 6 such fires during a 2026-05-02 fleet
    rolling reboot.
  * netdev_warn_once -> netdev_warn_ratelimited.  v1's
    netdev_warn_once made operational accounting impossible
    after the first fire on a given netdev; ratelimited keeps
    bounded log noise but lets operators count events.  Andrea
    asked for this directly.

Patches 1 and 3 are independently revertable.  Patch 2 v2 is a
two-line readback before an existing check; trivially revertable
in isolation, semantically dependent on the existing
macb_tx_complete_pending() recovery path that it strengthens.

## What I haven't done

  * TSO+SG-off canary.  rtheobald (cilium#43198 #4188846955)
    and the launchpad #2133877 commenter (#34) both report
    TSO+SG-off *together* mask the stall; my matrix has TSO+GSO
    tested off, not TSO+SG.  Happy to canary-test this on one
    node if reviewers want the data point before deciding which
    of patches 1/2 the SG path actually exercises.

  * Per-patch isolation testing.  All three deployed
    simultaneously on the 24-node fleet; I cannot independently
    prove patch 1 or patch 2 does anything on its own.  Patch 3
    has direct production evidence (lost-TCOMP recovery
    described above).  If reviewers want a bisection-style
    canary I can stagger one-patch / two-patch / three-patch
    nodes for >=1 week each.

## Status and testing

  * Mainline-anchored:  v2 builds clean against current net-next
    HEAD, applies cleanly.  Boot-tested and brief-sanity in a
    canary build before this send.
  * raspberrypi/linux rpi-6.18.y anchored equivalents:  in
    production on 24 nodes since 2026-05-02 (now 13 days); in
    raspberrypi/linux master since 2026-05-08 (6 days).
  * The v2 patch 2 IMR-barrier form was rolled to all 24 Pi
    nodes earlier today (2026-05-14, ~14:00 UTC) as a
    vendor-fork-anchored update.  ~120 cumulative node-hours
    of runtime since: zero mid-runtime TX stalls; zero user-space
    watchdog RECOVER events.  Cover-letter-thread reply with
    detail accompanies this series.

The series does not depend on any other in-flight work.  Happy
to split, rebase, drop, or restructure on feedback.

Lukasz Raczylo (3):
  net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
  net: macb: insert PCIe read barrier before TX completion descriptor
    check
  net: macb: add TX stall watchdog to recover from lost TCOMP interrupts

 drivers/net/ethernet/cadence/macb.h      | 14 ++++
 drivers/net/ethernet/cadence/macb_main.c | 95 ++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

-- 
2.54.0




More information about the linux-arm-kernel mailing list