[PATCH v2] um: time-travel: fix time corruption

Fri Oct 27 07:05:19 PDT 2023

Am 26.10.2023 um 11:10 schrieb Johannes Berg:
> On Thu, 2023-10-26 at 08:49 +0000, Vincent Whitchurch wrote:
>> On Thu, 2023-10-26 at 09:38 +0200, Johannes Berg wrote:
>>> On Thu, 2023-10-26 at 07:23 +0000, Vincent Whitchurch wrote:
>>>>> @@ -839,9 +863,7 @@ static u64 timer_read(struct clocksource *cs)
>>>>>   		 */
>>>>>   		if (!irqs_disabled() && !in_interrupt() && !in_softirq() &&
>>>>>   		    !time_travel_ext_waiting)
>>>>> -			time_travel_update_time(time_travel_time +
>>>>> -						TIMER_MULTIPLIER,
>>>>> -						false);
>>>>> +			time_travel_update_time_rel(TIMER_MULTIPLIER);
>>>>>   		return time_travel_time / TIMER_MULTIPLIER;
>>>>>   	}
>>>>
>>>> The reason I hesitated with putting the whole of
>>>> time_travel_update_time() under local_irq_save() in my attempt was
>>>> because I didn't quite understand the reason for the !irqs_disabled()
>>>> condition here and the comment just above it about recursion and things
>>>> getting messed up.  If it's OK to disable interrupts as this patch does,
>>>> is the !irqs_disabled() condition valid?
>>>
>>> Hmm. I was going to say that's different, because it wants to only
>>> prevent us from doing this while we're *already* in IRQ context, and the
>>> bug you found is calling timer_read() not in IRQ context, but getting an
>>> event queued by the signal.
>>>
>>> But ... now that I think about it, I have a feeling that this was a
>>> workaround for the exact same problem, and I just didn't understand it
>>> at the time? I mean, recursing into our own processing is now impossible
>>> here after this patch - either we're running normally, or the interrupt
>>> cannot hit timer_read() in the middle, same as it cannot hit
>>> time_travel_handle_real_alarm() in the middle now.
>>>
>>> Removing that still seems to work with your test, but it's also not a
>>> good test for this, since there are no devices etc. that could have
>>> interrupts, not sure how to test it right now?
>>>
>>> Maybe I'll add a comment there saying this might no longer be needed?
>>
>> I tried removing the !irqs_disabled() check and that blew up pretty
>> quickly (below) when running the full roadtest suite.  It works fine
>> with your unmodified patch so no need to change the comment.
>>
> 
> Hah, OK. So maybe when/if I remember what happens there or can figure it
> out again, I can update the comment :)
As I said, I'm also preparing some patches, therefore my few bits about 
those problems:

- we noticed, that those backward running time is mostly created by the 
"unprotected" access to the variables of the time travel event list. 
 From the stack trace you see the problem even do not appear directly at 
timer_read but later. I think mostly this means the event list head was 
moved and then the interrupt happens and the pointer (or stored time) 
become outdated and the function (here the alarm_handler) continues with 
them.

- besides this, when you look into local_irq_save for um, this does not 
really deactivate the interrupts, but delay the processing and it adds a 
memory barrier. I think that could be one of the important consequences 
as changes to the event list are forced to be populated. But this is 
only a guess :-D

- I'm also not really convinced, that all accesses to the current time 
should be delayed. I made a patch with a heuristic, that only delays the 
time read, if it is read from userspace multiple times in a row. Since 
there are no "busy" loops in the kernel itself and only in a few (bad 
behaving) userspace programs, I think that helps to reduce the 
inaccuracy in simulation from well behaving userspace programs only 
reading the time once.

In this case, irqs_disabled more or less only tend to indicate, that the 
event list could be manipulated, and therefore the update_time call is a 
bad idea.

Benjamin