Frequent spurious tx_timeouts for libertas
dsd at laptop.org
Sun May 1 17:21:58 EDT 2011
The libertas driver works as follows:
Once successfully connected to wifi, it enables the TX queue.
When the network layer calls hard_start_xmit, the driver stores the
packet in memory, stops the TX queue with netif_stop_queue(), and
Some time later, the libertas kernel thread wakes up, sees the new
packet stored for delivery, sends it to the card, then runs
netif_wake_queue() to get things moving again.
Sometimes the TX queue is stopped and started from other places (e.g.
while scanning for networks, which NetworkManager triggers every
couple of minutes)
Thus the TX queue is stopped and started very often.
The libertas tx_timeout handler is being called quite a lot, and quite
soon after boot we get this in the kernel logs (the first time it
[ 97.050101] ------------[ cut here ]------------
[ 97.054866] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0x102/0x17c
[ 97.062277] Hardware name: XO
[ 97.065335] NETDEV WATCHDOG: eth0 (libertas_sdio): transmit queue 0 timed out
<snip large trace>
This message seems bogus - all the recent transmissions happened correctly.
What is this watchdog trying to detect exactly? Using what criteria?
I had a look at the code and it doesn't quite make sense to me - also
explaining why we get this bogus message. Here's what happens:
1. Kernel asks libertas to transmit a packet; libertas stops the TX
queue and puts the packet in memory
2. Very shortly after, libertas thread picks up the packet, transmits
it via the card, and restarts the tx queue. The packet is transmitted
correctly, and the queue is awake again.
3. Shortly after (with no further TX activity occuring in this time),
NetworkManager starts scanning for networks. This causes the TX queue
to be stopped and started a few times over the space of a few seconds,
and during this time the queue is stopped for quite a while.
4. The netdev watchdog wakes up in one of these times when the TX
queue is stopped, and uses its criteria (last TX was a few seconds
ago, queue is currently stopped) to determine that a transmit timeout
has occurred (???)
Is libertas doing something wrong, or is this a bug/oddity in the network layer?
Also, while looking at this code, I spotted a bug in dev_watchdog():
* old device drivers set dev->trans_start
trans_start = txq->trans_start ? : dev->trans_start;
i.e. it is trying to figure out whether to read trans_start from txq
or dev. In both cases, trans_start is updated based on the value of
jiffies, which will occasionally be 0 (as it wraps around). Therefore
this line of code will occasionally make the wrong decision.
More information about the libertas-dev