[PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

Tue Jun 23 07:55:21 EDT 2020

On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
> 
> Cc: Thomas Gleixner <tglx at linutronix.de>
> Cc: Michael Kerrisk <mtk.manpages at gmail.com>
> Cc: Serge Hallyn <serge at hallyn.com>
> Cc: Dmitry Safonov <dima at arista.com>
> Cc: Andrei Vagin <avagin at gmail.com>
> Signed-off-by: Christian Brauner <christian.brauner at ubuntu.com>
> ---

Andrei,
Dmitry,

A little off-topic since its not related to the patch here but I've been
going through the current time namespace semantics and i just want to
confirm something with you:

Afaict, unshare(CLONE_NEWTIME) currently works similar to
unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
but does _not_ change the {pid, time} namespace of the caller itself.
For pid namespaces that makes a lot of sense but I'm not completely
clear why you're doing this for time namespaces, especially since the
setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
pid namespace of the caller itself, it only changes it for it's
children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
both the caller's and the children's time namespace is changed, i.e.
unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This also has the consequence that the unshare(CLONE_NEWTIME) +
setns(CLONE_NEWTIME) sequence can be used to change the callers pid
namespace. Is this intended?
Here's some code where you can verify this (please excuse the aweful
code I'm using to illustrate this):

int main(int argc, char *argv[])
{
	char buf1[4096], buf2[4096];

	if (unshare(0x00000080))
		exit(1);

	int fd = open("/proc/self/ns/time", O_RDONLY);
	if (fd < 0)
		exit(2);

	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
	printf("unshare(CLONE_NEWTIME):		time(%s) ~= time_for_children(%s)\n", buf1, buf2);

	if (setns(fd, 0x00000080))
		exit(3);

	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
	printf("setns(self, CLONE_NEWTIME):	time(%s) == time_for_children(%s)\n", buf1, buf2);

	exit(EXIT_SUCCESS);
}

which gives:

root at f2-vm:/# ./test
unshare(CLONE_NEWTIME):		time(time:[4026531834]) ~= time_for_children(time:[4026532366])
setns(self, CLONE_NEWTIME):	time(time:[4026531834]) == time_for_children(time:[4026531834])

why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
namespace when setns(CLONE_NEWTIME) is allowed to do this?

Christian