[LEDE-DEV] [BUG][NET] swconfig.c sleeping with spinlock held

Andrea Merello andrea.merello at gmail.com
Sat Jul 8 06:49:34 PDT 2017


Due to some OT reasons I'm compiling LEDE kernel (4.9.31) with several
debug checks enabled, and I'm using it on a Lantiq xrx200 board
(fritzbox 3370).

I've hit a bug in net/phy/swconfig.c driver: in swconfig_get_dev() we
attempt to take dev->sw_mutex with a spin_lock held (that's taken in
swconfig_lock() function). AFAIK this can cause deadlocks, and it does
cause the kernel to complain [0].

I'm not familiar with this driver code, neither with generic netlink,
however I have noticed that:

1- there is a "parallel_ops" flag in struct genl_family. I'm not sure,
but it seems it can be used to make sure that calls to genl_ops
callbacks are serialized from the network core; thus, is really
in-driver locking required?

2- looking at net core code that implements the lock above[1], they
seems using a regular mutex (not a spinlock), so I guess that genl_ops
callbacks are not called in atomic context. Why do we need a spinlock
in swconfig.c driver? I couldn't spot any interrupt, neither any
tasklet, in the driver code that could be racy wrt the genl_ops (and I
would expect spinlock_irqsave or spinlock_bh to be used in geln_ops
funcs in those cases); also, after a overview of external users of
exported functions (a bunch of phy drivers), I couldn't find any usage
that justifies spinlock usage.. am I missing something?

3- the two said locks are also both taken in reverse order in
unregister_switch(). Couldn't the reversed order cause another
deadlock?

If you could provide me some clues about how locking is designed in
swconfig driver (for what the two locks are for), then I could try to
fix the issue by myself and post a patch.

Thanks
Andrea

[0]
[   11.671567] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:620
[   11.678696] in_atomic(): 1, irqs_disabled(): 0, pid: 711, name: swconfig
[   11.685376] 3 locks held by swconfig/711:
[   11.689384]  #0:  (cb_lock){......}, at: [<80453a34>] genl_rcv+0x20/0x48
[   11.696054]  #1:  (genl_mutex){......}, at: [<80453aac>]
genl_rcv_msg+0x50/0x364
[   11.703444]  #2:  (swdevs_lock){......}, at: [<8039e48c>]
swconfig_get_dev.isra.6+0x2c/0xd0
[   11.711816] CPU: 1 PID: 711 Comm: swconfig Not tainted 4.9.31 #0
[   11.717739] Stack : 00000000 00000000 80e6c56a 00000034 80573bb4
00000000 00000000 807a0000
[   11.726084]         87e1a7ec 80795887 806f38bc 00000001 000002c7
809041fc 00000000 00000000
[   11.734439]         00000000 8007da38 00000000 00000000 00000000
8007da38 806faf04 874d1aec
[   11.742795]         00000002 800be57c 806f97e4 874d1afc 00000000
80790000 00000000 874d1a00
[   11.751151]         00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   11.759507]         ...
[   11.761946] Call Trace:
[   11.764400] [<80010b38>] show_stack+0x50/0x84
[   11.768788] [<802c0374>] dump_stack+0xd4/0x110
[   11.773200] [<800569b0>] ___might_sleep+0xfc/0x11c
[   11.778011] [<80556058>] mutex_lock_nested+0x4c/0x454
[   11.783050] [<8039e4d0>] swconfig_get_dev.isra.6+0x70/0xd0
[   11.788520] [<8039e56c>] swconfig_list_attrs+0x3c/0x230
[   11.793751] [<80453d10>] genl_rcv_msg+0x2b4/0x364
[   11.798449] [<80452db4>] netlink_rcv_skb+0x7c/0xf8
[   11.803228] [<80453a44>] genl_rcv+0x30/0x48
[   11.807415] [<804524d0>] netlink_unicast+0x190/0x2c4
[   11.812367] [<80452b54>] netlink_sendmsg+0x418/0x44c
[   11.817356] [<803ff4ac>] sock_sendmsg+0x18/0x30
[   11.821864] [<80400948>] ___sys_sendmsg+0x1bc/0x25c
[   11.826731] [<804018a8>] __sys_sendmsg+0x4c/0x80
[   11.831353] [<8001ad38>] syscall_common+0x34/0x58

[1]
http://elixir.free-electrons.com/linux/latest/source/net/netlink/genetlink.c#L30
http://elixir.free-electrons.com/linux/latest/source/net/netlink/genetlink.c#L610



More information about the Lede-dev mailing list