[PATCH v2] um: time-travel: fix time corruption
Benjamin Beichler
Benjamin.Beichler at uni-rostock.de
Fri Oct 27 07:05:19 PDT 2023
Am 26.10.2023 um 11:10 schrieb Johannes Berg:
> On Thu, 2023-10-26 at 08:49 +0000, Vincent Whitchurch wrote:
>> On Thu, 2023-10-26 at 09:38 +0200, Johannes Berg wrote:
>>> On Thu, 2023-10-26 at 07:23 +0000, Vincent Whitchurch wrote:
>>>>> @@ -839,9 +863,7 @@ static u64 timer_read(struct clocksource *cs)
>>>>> */
>>>>> if (!irqs_disabled() && !in_interrupt() && !in_softirq() &&
>>>>> !time_travel_ext_waiting)
>>>>> - time_travel_update_time(time_travel_time +
>>>>> - TIMER_MULTIPLIER,
>>>>> - false);
>>>>> + time_travel_update_time_rel(TIMER_MULTIPLIER);
>>>>> return time_travel_time / TIMER_MULTIPLIER;
>>>>> }
>>>>
>>>> The reason I hesitated with putting the whole of
>>>> time_travel_update_time() under local_irq_save() in my attempt was
>>>> because I didn't quite understand the reason for the !irqs_disabled()
>>>> condition here and the comment just above it about recursion and things
>>>> getting messed up. If it's OK to disable interrupts as this patch does,
>>>> is the !irqs_disabled() condition valid?
>>>
>>> Hmm. I was going to say that's different, because it wants to only
>>> prevent us from doing this while we're *already* in IRQ context, and the
>>> bug you found is calling timer_read() not in IRQ context, but getting an
>>> event queued by the signal.
>>>
>>> But ... now that I think about it, I have a feeling that this was a
>>> workaround for the exact same problem, and I just didn't understand it
>>> at the time? I mean, recursing into our own processing is now impossible
>>> here after this patch - either we're running normally, or the interrupt
>>> cannot hit timer_read() in the middle, same as it cannot hit
>>> time_travel_handle_real_alarm() in the middle now.
>>>
>>> Removing that still seems to work with your test, but it's also not a
>>> good test for this, since there are no devices etc. that could have
>>> interrupts, not sure how to test it right now?
>>>
>>> Maybe I'll add a comment there saying this might no longer be needed?
>>
>> I tried removing the !irqs_disabled() check and that blew up pretty
>> quickly (below) when running the full roadtest suite. It works fine
>> with your unmodified patch so no need to change the comment.
>>
>
> Hah, OK. So maybe when/if I remember what happens there or can figure it
> out again, I can update the comment :)
As I said, I'm also preparing some patches, therefore my few bits about
those problems:
- we noticed, that those backward running time is mostly created by the
"unprotected" access to the variables of the time travel event list.
From the stack trace you see the problem even do not appear directly at
timer_read but later. I think mostly this means the event list head was
moved and then the interrupt happens and the pointer (or stored time)
become outdated and the function (here the alarm_handler) continues with
them.
- besides this, when you look into local_irq_save for um, this does not
really deactivate the interrupts, but delay the processing and it adds a
memory barrier. I think that could be one of the important consequences
as changes to the event list are forced to be populated. But this is
only a guess :-D
- I'm also not really convinced, that all accesses to the current time
should be delayed. I made a patch with a heuristic, that only delays the
time read, if it is read from userspace multiple times in a row. Since
there are no "busy" loops in the kernel itself and only in a few (bad
behaving) userspace programs, I think that helps to reduce the
inaccuracy in simulation from well behaving userspace programs only
reading the time once.
In this case, irqs_disabled more or less only tend to indicate, that the
event list could be manipulated, and therefore the update_time call is a
bad idea.
Benjamin
More information about the linux-um
mailing list