[LEDE-DEV] [BUG][NET] swconfig.c sleeping with spinlock held
Andrea Merello
andrea.merello at gmail.com
Sat Jul 8 06:49:34 PDT 2017
Due to some OT reasons I'm compiling LEDE kernel (4.9.31) with several
debug checks enabled, and I'm using it on a Lantiq xrx200 board
(fritzbox 3370).
I've hit a bug in net/phy/swconfig.c driver: in swconfig_get_dev() we
attempt to take dev->sw_mutex with a spin_lock held (that's taken in
swconfig_lock() function). AFAIK this can cause deadlocks, and it does
cause the kernel to complain [0].
I'm not familiar with this driver code, neither with generic netlink,
however I have noticed that:
1- there is a "parallel_ops" flag in struct genl_family. I'm not sure,
but it seems it can be used to make sure that calls to genl_ops
callbacks are serialized from the network core; thus, is really
in-driver locking required?
2- looking at net core code that implements the lock above[1], they
seems using a regular mutex (not a spinlock), so I guess that genl_ops
callbacks are not called in atomic context. Why do we need a spinlock
in swconfig.c driver? I couldn't spot any interrupt, neither any
tasklet, in the driver code that could be racy wrt the genl_ops (and I
would expect spinlock_irqsave or spinlock_bh to be used in geln_ops
funcs in those cases); also, after a overview of external users of
exported functions (a bunch of phy drivers), I couldn't find any usage
that justifies spinlock usage.. am I missing something?
3- the two said locks are also both taken in reverse order in
unregister_switch(). Couldn't the reversed order cause another
deadlock?
If you could provide me some clues about how locking is designed in
swconfig driver (for what the two locks are for), then I could try to
fix the issue by myself and post a patch.
Thanks
Andrea
[0]
[ 11.671567] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:620
[ 11.678696] in_atomic(): 1, irqs_disabled(): 0, pid: 711, name: swconfig
[ 11.685376] 3 locks held by swconfig/711:
[ 11.689384] #0: (cb_lock){......}, at: [<80453a34>] genl_rcv+0x20/0x48
[ 11.696054] #1: (genl_mutex){......}, at: [<80453aac>]
genl_rcv_msg+0x50/0x364
[ 11.703444] #2: (swdevs_lock){......}, at: [<8039e48c>]
swconfig_get_dev.isra.6+0x2c/0xd0
[ 11.711816] CPU: 1 PID: 711 Comm: swconfig Not tainted 4.9.31 #0
[ 11.717739] Stack : 00000000 00000000 80e6c56a 00000034 80573bb4
00000000 00000000 807a0000
[ 11.726084] 87e1a7ec 80795887 806f38bc 00000001 000002c7
809041fc 00000000 00000000
[ 11.734439] 00000000 8007da38 00000000 00000000 00000000
8007da38 806faf04 874d1aec
[ 11.742795] 00000002 800be57c 806f97e4 874d1afc 00000000
80790000 00000000 874d1a00
[ 11.751151] 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 11.759507] ...
[ 11.761946] Call Trace:
[ 11.764400] [<80010b38>] show_stack+0x50/0x84
[ 11.768788] [<802c0374>] dump_stack+0xd4/0x110
[ 11.773200] [<800569b0>] ___might_sleep+0xfc/0x11c
[ 11.778011] [<80556058>] mutex_lock_nested+0x4c/0x454
[ 11.783050] [<8039e4d0>] swconfig_get_dev.isra.6+0x70/0xd0
[ 11.788520] [<8039e56c>] swconfig_list_attrs+0x3c/0x230
[ 11.793751] [<80453d10>] genl_rcv_msg+0x2b4/0x364
[ 11.798449] [<80452db4>] netlink_rcv_skb+0x7c/0xf8
[ 11.803228] [<80453a44>] genl_rcv+0x30/0x48
[ 11.807415] [<804524d0>] netlink_unicast+0x190/0x2c4
[ 11.812367] [<80452b54>] netlink_sendmsg+0x418/0x44c
[ 11.817356] [<803ff4ac>] sock_sendmsg+0x18/0x30
[ 11.821864] [<80400948>] ___sys_sendmsg+0x1bc/0x25c
[ 11.826731] [<804018a8>] __sys_sendmsg+0x4c/0x80
[ 11.831353] [<8001ad38>] syscall_common+0x34/0x58
[1]
http://elixir.free-electrons.com/linux/latest/source/net/netlink/genetlink.c#L30
http://elixir.free-electrons.com/linux/latest/source/net/netlink/genetlink.c#L610
More information about the Lede-dev
mailing list