problem with default local port(nl_pid) when netlink used both via libnl and directly in same application

Thomas Graf tgraf at infradead.org
Mon May 7 08:53:31 EDT 2012


On Mon, May 07, 2012 at 05:05:32AM -0400, Laine Stump wrote:
> I've just diagnosed a problem in libvirt that traces back to libnl's
> unilateral decision to use getpid() of the calling process as the
> default "local port" (nl_pid) for the first netlink socket it creates
> for each process.
> 
> The problem is that this is also the default value used it a piece of
> code running in that process uses direct system calls to create/bind a
> netlink socket. In our example, this was the result of calling glibc's
> getaddrinfo() function, so we weren't even aware that it was happening.
> Even though getaddrinfo() only keeps its netlink socket connected for a
> short period, if that is running in a separate thread from the thread
> that calls nl_handle_alloc()/nl_connect(), the result will be that the
> bind() in nl_connect() fails with EADDRINUSE.
> 
> Although we're working around the problem in libvirt, I thought I should
> bring it up here as well, since this same problem could bite any other
> application that has has similar dual uses of netlink both directly and
> via libnl (in many cases without even realizing it).

Thanks a lot for the notification Laine!

> Note that during the discussion (duplicated below for convenience) I
> point out that, while it avoids the collision, simply modifying libnl to
> skip the first local port is not a good solution in the general case
> because existing applications that use libnl may be depending on that
> behavior; as a matter of fact that was the case with communication
> between libvirt and lldpad.
> 
> Also note that changing nl_connect() to retry the bind with a different
> port is also not a full solution, both for the above reason as well as
> because libnl allows the application to retrieve local port with
> nl_socket_get_local_port() before nl_connect() is called, so an
> application may have already stored the local port information, and
> silently changing it during nl_connect() would lead to inconsistency
> between what the application believes and reality.

This is exactly the root of the problem. Early users of netlink assumed
that local port always equals to the pid and we have to maintain
backwards compatibility ever since. I can't think of anything we can do
that wouldn't create as much problems as it would solve so I guess we
are stuck with this unless someone comes up with a smart idea.



More information about the libnl mailing list