[PATCH] um: insert scheduler ticks when userspace does not yield

Tue Sep 24 03:46:21 PDT 2024

On Mon, 2024-09-23 at 23:50 +0200, Benjamin Beichler wrote:
> Hi,
> 
> Am 23.09.2024 um 16:48 schrieb Benjamin Berg:
> > > Actually, I think, timeouts are no problem, if we can assure, that a
> > > timeout is never rounded down to 0. Mostly a direct input of 0 have
> > > special meanings, or provokes wrong behavior in the first place from
> > > user space program.
> > I don't think that is a problem. The kernel should guarantee that a
> > timeout never fires too early.
> > 
> > I believe in the case of the linked python code, the timeout fires at
> > exactly the correct time. And then the python code (incorrectly)
> > detects that the timeout has not passed and tries to "select" again
> > with a timeout of exactly zero.
> > 
> > Really, that implementation is just buggy in subtle ways. It could
> > probably just trust the kernel to not wake up early. And, if it does
> > check whether the timeout has passed, then it should just accept the
> > exact time.
> 
> Maybe I'm doing a captain obvious here, but I had the impression this
> code was written this way, to handle interruptions by signals and not to 
> doubt the time accuracy. Possibly I'm totally wrong, but it seems quite 
> elegant to simply use time here to avoid that dance to mask signals or 
> check for interruptions etc.

Yeah, you are right. I wasn't sure what python does in case of EINTR.
But it does look like it'll stop the select in that case after handling
the signal.

> I believe this code was written in mind that time() will advance, so 
> this will never be an endless loop, so even the corner case that timeout 
> was 0 would be covered by this

Dunno, it feels to me like someone just didn't think much about whether
it should be a "<" or "<=" operator there. Simply changing the
    if timeout < 0: 
to
    if timeout <= 0:
will resolve the problem and is more correct in my view. A specific
"time" is valid in the interval from the clock tick until the next
tick. So the only sensible thing is to assume that you are somewhere
within the interval *after* the tick, which means the timeout has
passed already if it is exactly zero.

Trying to do a zero length sleep to wait for the next clock tick does
not make sense. Especially as a zero length sleep is defined to not
sleep.

> > > Since time-travel mode has a very limited niche, I would not try to
> > > prevent every possible dumb behavior that bad user space programs could
> > > have. I think busy-waiting on a system clock advancement is not the best
> > > style, but acceptable.
> > > 
> > > So my list was:
> > > 
> > > sys_getitimer
> > > sys_gettimeofday
> > > sys_time
> > > sys_timer_gettime
> > > sys_clock_gettime
> > > sys_timerfd_gettime
> > > 
> > > While overthinking it, I see the possibility to read the access
> > > timestamps of a file to create an endless loop, so maybe the stat
> > > syscalls may be included, although this makes me a bit uncomfortable
> > > again. I tend to say, this "bad" behavior of asking the same information
> > > over and over again, should only be punished, if it happens multiple times.
> > > 
> > > I was thinking about, storing the PID of a busy-looped process, and only
> > > increase time, if the same PID is "suspicious".  However, this "hack"
> > > becomes more and more costly, which is on the other hand not important
> > > for timetravel mode.
> > Maybe a stupid question, but aren't we overthinking this in general?
> > 
> > While I think that Johannes' solution to make reading the time cost
> > time is kind of ingenious, I really wonder how much of an issue this
> > actually is. Because if this is just a few userspace applications and
> > libraries misbehaving, then we might as well fix the issue there
> > instead of doing anything special in UML.
> 
> Your point is right, and such bugs may be fixed in user space. On the
> other hand, what about software we can't or don't want to fix, which in 
> the wild simply works. For my future use cases, I will run code, that
> I'm not able to compile myself. I would even consider to have a runtime 
> switch to change the behavior of this hack, to reduce the overhead in
> simulations that behave nicely, but have some quick workaround for 
> misbehaving code.

Sure. But we could still decide to not support that in the upstream
kernel. How you solve the problem in your application would be up to
you. You can apply a simple patch or find another solution in
userspace.

> And sorry for repeating myself, but I believe, that busy waiting on an 
> increasing timer value is not the best style, but considered okay/normal 
> for some use cases. So I think it would be helpful to be able to execute 
> such user space code.

Right, it is just that I have not actually seen anyone wanting to do a
busy wait on a time. The python example explicitly tries to sleep and
just gets the rounding wrong.

> But I want to bring in another idea: Could we use an ebpf program to 
> dynamically hook into syscalls and do a timetravel_update or something 
> similar? Actually, I do not know whether ebpf works normally in UM, but 
> that way it would be flexible and moving the dirty hacks into small 
> portions outside the kernel. From what I understand, we would need to
> add an ebpf callable wrapper for the time travel update function, isn't it?

Hmm, possibly. I am not familiar with what is possible to do when
tracing syscalls using eBPF.

That said, I wonder if we should be inserting a clock_nanosleep()
conceptually. If you want to force a process to continue running after
the next clock tick, then you might want to schedule all other runnable
tasks *before* letting the clock tick.

> > > > One neat side effect is that if reading time does not actually cost
> > > > time, then we could implement clock_gettime in the VDSO.
> > > That would exactly not work, because of my comment from before.
> > Of course. It is just that I have always in the back of my mind that
> > syscalls and pagefaults (including minor faults) are really expensive
> > in UML. So if the hack is moved elsewhere then implementing
> > clock_gettime in the vDSO could be an easy win to speed up the
> > simulation.
> 
> Mhh I did only a quick look into "arch/x86/um/vdso/um_vdso.c" and from 
> my understanding, currently every vdso call is converted into syscalls 
> of the host. So we need much more code to use here the time travel 
> clock, isn't it? Of course, my proposed ebpf hook would not work here
> either...

Yeah, but I don't expect it to be that complicated. If the clock
doesn't actually change, then it would be trivial to just return a
constant value.

It gets more complicated if reading time should take time. But even
then, we can probably hide some (writable) bookkeeping data on the stub
page (as the vDSO data should be read-only). And outside of time-travel
a SECCOMP execution model would allow us to do direct host syscalls.

Anyway, I suppose vDSO improvements are not necessarily a short term
project.

Benjamin