Netlink sockets and concurrency
Dan Williams
dcbw at redhat.com
Thu Feb 23 09:46:37 PST 2017
On Thu, 2017-02-23 at 10:38 -0500, Matt Layher wrote:
> Hi all,
>
> This question isn't directly related to libnl, but rather to netlink
> and
> netlink sockets themselves. I wasn't sure where else to ask, but I
> figured the folks on this list should have some good experience
> working
> with netlink.
>
> I built a Go package (https://github.com/mdlayher/netlink) for
> working
> with netlink sockets, but am seeing some occasional strange behavior
> when attempting to use multiple sockets from the same application.
>
> For whatever reason, netlink appears to occasionally send a reply
> message to the wrong socket, when being called concurrently. I'm
> opening 16 genetlink sockets and giving each socket its own "thread"
> ("goroutine" in Go). I pick a sequence number at random for each
> socket, and then increment it each time a message is sent.
Be careful with concurrency, Go, and system calls.
Go's concurrency model is not a strict 1:1 mapping between goroutines
and OS threads. The Go scheduler will often mix and match goroutines
between OS threads on the fly, and you can never guarantee which
goroutine is running on which OS thread, even during the life of the
goroutine.
So don't assume that a goroutine will run on any specific OS thread at
any point. Unless...
You can use the LockOSThread()/UnlockOSThread() to ensure that a single
goroutine is the only one on a given OS thread for its lifetime, and
that no other goroutines will run on that OS thread. This of course
kills parallelism since the Go scheduler can't run anything else in
that OS thread.
For more somewhat related info, see:
https://github.com/containernetworking/cni/tree/master/pkg/ns
I'm not sure why this might cause problems, but you mention threads and
goroutines and that's a trigger :)
---
Anyway, it looks like you're letting the kernel allocate nl_pid. As a
test, what if you create a unique nl_pid for each Conn object before
you bind it, to take the kernel out of the loop for debugging purposes?
Dan
> At this point, I send 10,000 messages from each socket with the
> flags
> "request + acknowledge", so netlink will echo back the message I sent
> to
> it. Again, before each message is sent, I increment the internal
> sequence number of my socket wrapper.
>
> For whatever reason, sometimes I receive a reply back from netlink
> with
> an unexpected sequence number. The sequence number often looks like
> it
> was meant for another socket in the test, running in a different
> thread.
>
> Is it safe to open multiple sockets to netlink (genetlink,
> specifically)
> in the same application and use them concurrently in this way? As far
> as
> I can tell, my code is free of race conditions in user-space
> (verified
> using Go's race detector). I am not sharing a single socket between
> multiple threads. I am simply sending and receiving on multiple
> sockets
> at the same time, in independent threads.
>
> It doesn't appear that libnl has any special "global lock", other
> than
> the PID assignment map. I am no C expert, but I'm curious if there
> is a
> workaround in libnl for making use of multiple sockets concurrently,
> to
> ensure that messages are delivered properly to the expected socket.
>
> Thanks for your time. I'd certainly appreciate any insight you all
> may
> have on this matter.
>
> - Matt Layher
>
> _______________________________________________
> libnl mailing list
> libnl at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/libnl
More information about the libnl
mailing list