[PATCH v3 0/2] * drivers: net: sun4i-emac: Fix emac_timeout *

Thu Apr 27 03:52:29 PDT 2023

From: qianfan Zhao <qianfanguijin at 163.com>

History:

2022-09-12:
Introduce the first patch and can read it from:
https://lkml.kernel.org/lkml/20220912063331.23369-1-qianfanguijin@163.com/
That was reviewed by Jernej Skrabec <jernej.skrabec at gmail.com> but have not
marged.

2023-04-27:

Apply the first patch and I found the bug was not fully fixed.
I also get those error messages sometimes:

[  108.581230] spi_master spi2: spi2.1: timeout transferring 1025 bytes at 100000Hz for 190(164)ms
[  108.590337] spidev spi2.1: SPI transfer failed: -110
[  108.595443] spi_master spi2: failed to transfer one message from queue
...

I had tried `kdump` and `crash` tools but noting is useful.

Few days later I found `softirq` takes about 100% cpu of a cpu core, listen
softirq_entry, softirq_exit, net_dev_xmit events and I got those flood
messages:

289.902631: softirq_entry: vec=2 [action=NET_TX]
289.902651: net_dev_xmit: dev=eth0 skbaddr=(ptrval) len=98 rc=16
289.902656: softirq_exit: vec=2 [action=NET_TX]
289.902659: softirq_entry: vec=2 [action=NET_TX]
289.902664: net_dev_xmit: dev=eth0 skbaddr=(ptrval) len=98 rc=16
289.902668: softirq_exit: vec=2 [action=NET_TX]
...

And then I debug the linux kernel under qemu, make the emac-driver in qemu
drop some tx packages by this way:

diff --git a/hw/net/allwinner_emac.c b/hw/net/allwinner_emac.c
index 372e5b66da..28dfb1116b 100644
--- a/hw/net/allwinner_emac.c
+++ b/hw/net/allwinner_emac.c
@@ -349,9 +349,14 @@ static void aw_emac_write(void *opaque, hwaddr offset, uint64_t value,
                               "allwinner_emac: TX length > fifo data length\n");
             }
             if (len > 0) {
+                int ignore = random() % 10 < 1;
                 data = fifo8_pop_buf(fifo, len, &ret);
-                qemu_send_packet(nc, data, ret);
+                if (!ignore)
+                    qemu_send_packet(nc, data, ret);
                 aw_emac_tx_reset(s, chan);
+
+                if (ignore)
+                    break;
                 /* Raise TX interrupt */
                 s->int_sta |= EMAC_INT_TX_CHAN(chan);
                 aw_emac_update_irq(s);

It's very easy to reproduce this bug now.

Next is the backtrace of gdb when softirq was raise again:

#0  __raise_softirq_irqoff (nr=nr at entry=2) at kernel/softirq.c:699
#1  raise_softirq_irqoff (nr=nr at entry=2) at kernel/softirq.c:671
#2  0xc0855a34 in __netif_reschedule (q=0xc2027c00) at net/core/dev.c:3041
#3  __netif_schedule (q=q at entry=0xc2027c00) at net/core/dev.c:3048
#4  0xc085b0ec in qdisc_run_end (qdisc=0xc2027c00) at ./include/net/sch_generic.h:227
#5  qdisc_run (q=0xc2027c00) at ./include/net/pkt_sched.h:133
#6  net_tx_action (h=<optimized out>) at net/core/dev.c:5046
#7  0xc0101298 in __do_softirq () at kernel/softirq.c:558
#8  0xc0127cd0 in run_ksoftirqd (cpu=<optimized out>) at kernel/softirq.c:920
#9  0xc01487d0 in smpboot_thread_fn (data=0xc14a2780) at kernel/smpboot.c:164
#10 0xc0144b58 in kthread (_create=0xc14a2800) at kernel/kthread.c:319
#11 0xc0100130 in ret_from_fork () at arch/arm/kernel/entry-common.S:146
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

`net_tx_action` is running in `__do_softirq` and it will send package when
`qdisc_run`. But the emac driver in linux alway return NETDEV_TX_BUSY(16)
after emac_timeout due to we forget reset `db->tx_fifo_stat`,
that will make `__netif_schedule` raise softirq again and again.

qianfan Zhao (2):
  drivers: net: sun4i-emac: Fix double spinlock in emac_timeout
  drivers: net: sun4i-emac: Fix emac_timeout

 drivers/net/ethernet/allwinner/sun4i-emac.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

--
2.25.1




[PATCH v3 0/2] *** drivers: net: sun4i-emac: Fix emac_timeout ***

[PATCH v3 0/2] * drivers: net: sun4i-emac: Fix emac_timeout *