Netlink sockets and concurrency
Matt Layher
mdlayher at gmail.com
Thu Feb 23 11:43:42 PST 2017
Good news! I got a new error!
I ended up locking each goroutine to its OS thread _before_ dialing out
to netlink. For some reason, it hadn't occurred to me to that I should
do this before dialing, and I was just locking before the send/receive
calls. Since doing this, I haven't seen it fail with a mismatched
sequence number yet.
I did try assigning a PID myself during bind() (thanks for the pointer,
it didn't even occur to me to try that before), but it made no
difference. It appears that locking OS threads is the correct
solution. Left it alone and I'm letting netlink assign the PIDs again.
Now I'm occasionally getting EINVAL from netlink on sendto(), but not
consistently. Here's are some example byte slices that produce the error:
[]byte{0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x0, 0xb9, 0xf, 0x0, 0x0,
0x4c, 0xc9, 0xfc, 0xff}
[]byte{0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x0, 0x19, 0xf, 0x0, 0x0,
0x80, 0xc7, 0xfc, 0xff}
[]byte{0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x0, 0x56, 0x6, 0x0, 0x0,
0xa6, 0xc6, 0xfc, 0xff}
[]byte{0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x0, 0x17, 0x1, 0x0, 0x0,
0x47, 0x8f, 0xfc, 0xff}
As far as I can tell, these are fairly normal netlink messages. I guess
it's time for some kernel spelunking to see what might make EINVAL happen.
Thanks for helping me solve my original problem though, and for teaching
me a couple new things about netlink in the process.
- Matt Layher
On 02/23/2017 01:15 PM, Matt Layher wrote:
> Apologies, I missed the last part of your message, under "---".
>
> I had tried this early on, but netlink just seemed to ignore the PID I
> assigned to it. It assigns the first socket a PID of the process's
> PID, then seems to just pick one at random for any subsequent
> connections.
>
> Ever since then, I've just let netlink assign the PID on its own.
> However, now that you made me look back at my code, I think I may have
> found a bug in that area. I'll try a couple of things out and report
> back if that ends up fixing the problem.
>
> Thanks again!
> - Matt
>
>
> On 02/23/2017 01:01 PM, Matt Layher wrote:
>> Thanks for the reply. Yeah, I actually thought about goroutines not
>> mapping to threads right after I sent this, and I tried using
>> runtime.LockOSThread and runtime.UnlockOSThread immediately when a
>> goroutine spun up.
>>
>> Still encountered the same problem that way though, sadly. I'll
>> check out your link now, thanks!
>>
>> Also worth noting that I went ahead and tried an actual test with
>> genetlink: same scenario, but looking up family information for
>> nlctrl. Let that run in a loop for 10 minutes, and then 'go test'
>> sent SIGQUIT since it ran too long. No crashes there. I'm curious if
>> something about my "synthetic" test was making it act up.
>>
>> I'll keep looking into it. Thanks again for the reply.
>>
>> - Matt
>>
>>
>> On 02/23/2017 12:46 PM, Dan Williams wrote:
>>> On Thu, 2017-02-23 at 10:38 -0500, Matt Layher wrote:
>>>> Hi all,
>>>>
>>>> This question isn't directly related to libnl, but rather to netlink
>>>> and
>>>> netlink sockets themselves. I wasn't sure where else to ask, but I
>>>> figured the folks on this list should have some good experience
>>>> working
>>>> with netlink.
>>>>
>>>> I built a Go package (https://github.com/mdlayher/netlink) for
>>>> working
>>>> with netlink sockets, but am seeing some occasional strange behavior
>>>> when attempting to use multiple sockets from the same application.
>>>>
>>>> For whatever reason, netlink appears to occasionally send a reply
>>>> message to the wrong socket, when being called concurrently. I'm
>>>> opening 16 genetlink sockets and giving each socket its own "thread"
>>>> ("goroutine" in Go). I pick a sequence number at random for each
>>>> socket, and then increment it each time a message is sent.
>>> Be careful with concurrency, Go, and system calls.
>>>
>>> Go's concurrency model is not a strict 1:1 mapping between goroutines
>>> and OS threads. The Go scheduler will often mix and match goroutines
>>> between OS threads on the fly, and you can never guarantee which
>>> goroutine is running on which OS thread, even during the life of the
>>> goroutine.
>>>
>>> So don't assume that a goroutine will run on any specific OS thread at
>>> any point. Unless...
>>>
>>> You can use the LockOSThread()/UnlockOSThread() to ensure that a single
>>> goroutine is the only one on a given OS thread for its lifetime, and
>>> that no other goroutines will run on that OS thread. This of course
>>> kills parallelism since the Go scheduler can't run anything else in
>>> that OS thread.
>>>
>>> For more somewhat related info, see:
>>> https://github.com/containernetworking/cni/tree/master/pkg/ns
>>>
>>> I'm not sure why this might cause problems, but you mention threads and
>>> goroutines and that's a trigger :)
>>> ---
>>>
>>> Anyway, it looks like you're letting the kernel allocate nl_pid. As a
>>> test, what if you create a unique nl_pid for each Conn object before
>>> you bind it, to take the kernel out of the loop for debugging purposes?
>>>
>>> Dan
>>>
>>>> At this point, I send 10,000 messages from each socket with the
>>>> flags
>>>> "request + acknowledge", so netlink will echo back the message I sent
>>>> to
>>>> it. Again, before each message is sent, I increment the internal
>>>> sequence number of my socket wrapper.
>>>>
>>>> For whatever reason, sometimes I receive a reply back from netlink
>>>> with
>>>> an unexpected sequence number. The sequence number often looks like
>>>> it
>>>> was meant for another socket in the test, running in a different
>>>> thread.
>>>>
>>>> Is it safe to open multiple sockets to netlink (genetlink,
>>>> specifically)
>>>> in the same application and use them concurrently in this way? As far
>>>> as
>>>> I can tell, my code is free of race conditions in user-space
>>>> (verified
>>>> using Go's race detector). I am not sharing a single socket between
>>>> multiple threads. I am simply sending and receiving on multiple
>>>> sockets
>>>> at the same time, in independent threads.
>>>>
>>>> It doesn't appear that libnl has any special "global lock", other
>>>> than
>>>> the PID assignment map. I am no C expert, but I'm curious if there
>>>> is a
>>>> workaround in libnl for making use of multiple sockets concurrently,
>>>> to
>>>> ensure that messages are delivered properly to the expected socket.
>>>>
>>>> Thanks for your time. I'd certainly appreciate any insight you all
>>>> may
>>>> have on this matter.
>>>>
>>>> - Matt Layher
>>>>
>>>> _______________________________________________
>>>> libnl mailing list
>>>> libnl at lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/libnl
>>
>
More information about the libnl
mailing list