[RFC] arm64: Enforce gettimeofday vdso structure read ordering
bdegraaf at codeaurora.org
bdegraaf at codeaurora.org
Mon Aug 22 12:32:44 PDT 2016
On 2016-08-22 14:56, Mark Rutland wrote:
> Hi Brent,
>
> Thanks for the thorough reply. Comments inline below.
>
> On Mon, Aug 22, 2016 at 01:32:47PM -0400, bdegraaf at codeaurora.org
> wrote:
>> On 2016-08-22 07:37, Mark Rutland wrote:
>> >* What problem does this patch address?
>>
>> Initially, I set out to fix a control-flow problem, as I originally
>> wrote this against the code prior to the refactoring of commit
>> b33f491f5a9aaf171b7de0f905362eb0314af478, back when there was still a
>> do_get_tspec subroutine. That one used a dmb to order the
>> data accesses prior to the isb/mrs sequence that read the virtual
>> counter. Our Senior Director, who has done extensive work with the
>> ARM
>> cpu and is intimately familiar with the instruction set, indicated
>> that
>> the dmb which was used in that code was not enough to ensure ordering
>> between the loads from the structure and the read of the virtual
>> counter.
>> Since that code had no control-flow (e.g., some conditional governing
>> the code's progress) prior to the isb, he suggested that some form of
>> dsb would be required to ensure proper order of access between the
>> loads from the vdso structure and the mrs read of the the virtual
>> counter.
>
> Ok. So if I've parsed the above correctly, the fear was that an ISB was
> insufficient to guarantee the ordering of prior loads w.r.t. the
> subsequent MRS, and a control dependency between the two was necessary,
> in addition to the ISB.
>
Exactly.
>> I went to the latest armv8 ARM at that time and found a concrete
>> example
>> of how the code should be structured to ensure an ordered read of the
>> virtual counter. In the most recent copy to which I have access
>> (ARM DDI 0487A.j), that code is given on page D6-1871, under section
>> D6.2.2. I moved the sequence count check immediately prior to the
>> isb to satisfy the particular ordering requirements in that code.
>
> My reading of that example is that the control dependency alone was
> insufficient (given speculation), and the ISB provided the necessary
> ordering between the signal variable being updated and the MRS. To me,
> the example does not imply that both are required in all cases, only
> that a control dependency alone is insufficient.
>
> Per the description of ISB on page B2-87 of ARM DDI 0487A.j, my
> understanding (which may be flawed), is that the instructions prior to
> the ISB must be completed before the subsequent instructions are
> fetched+issued, and hence the MRS should be (locally) ordered w.r.t.
> the
> loads.
>
Saying the instructions are completed reportedly isn't exactly the same
as saying the loads have been accessed (I brought up this same point to
him when we discussed it). If a control dependency is not present prior
to the ISB, he said we would have to add a DSB to ensure ordering.
Not wanting to potentially slow things down across multiple cores with
a DSB, the control dependency is the lighter-weight solution. I would
not have known that the control dependency were available in this situ-
ation had the ARM not mentioned it.
>> When the refactored code went in recently, however, the seqcnt_read
>> prior to the isb changed to a seqcnt_check, which addressed that
>> ordering requirement.
>
> Have you seen an issue in practice prior to this? If so, we may need to
> go digging further into this, and consider stable kernels.
>
I had not seen a problem. We were looking at the speed of the code to
see if anything could be helped when the Sr. Director noticed the
correctness problem. After I applied the fix, results given by
gettimeofday got tighter: averages that varied by up to 2 nsec before
now vary by only one or two hundredths of a nanosecond.
>> >* Is this a bug fix? If so, what problem can be seen currently?
>>
>> The most obvious problem with the existing code is where the timezone
>> data gets loaded: after sequence counts have been checked and
>> rechecked, completely outside the area of the code protected by the
>> sequence counter. While I realize that timezone code does not change
>> frequently, this is still a problem as the routine explicitly reads
>> data that could be in the process of being updated.
>
> I take that you specifically mean the following line in
> __kernel_gettimeofday, which occurs after the usual seqcnt_acquire +
> seqcnt_check sequence:
>
> ldp w4, w5, [vdso_data, #VDSO_TZ_MINWEST]
>
> I see that the writer side for the timezone data is also not protected,
> since commit bdba0051ebcb3c63 ("arm64: vdso: remove broken, redundant
> sequence counting for timezones"). Following comments in commits I
> found
> x86 commit 6c260d586343f7f7 ("x86: vdso: Remove bogus locking in
> update_vsyscall_tz()").
>
> Per the x86 commit, this is not atomic in the usual syscall path, and
> thus it is a cross-architecture property that timezone updates are not
> atomic, and that reordering of accesses may occur around a change of
> timezone. If we want to tighten up the VDSO, we'll also need to tighten
> up the syscall.
>
> [...]
>
Ugh. I had not looked at the writer. That would explain why the comments
refer to it as "whacky tz stuff."
>> The second problem is that the timing between the reading of the vdso
>> data and the virtual counter is treated as secondary in the code, as a
>> few register manipulations using the structure data are performed
>> prior to reading the virtual counter. While the cpu itself is free to
>> reorder these shifts and loads somewhat, depending on the
>> implementation, the read of the virtual counter should be performed as
>> close as possible to the time the vdso data itself is read to minimize
>> variability. As these manipulations and loads are not dependent on
>> the result of the mrs read, putting them after the virtual counter
>> isb/mrs sequence allows these independent register manipulations to
>> issue while the mrs is still in process.
>
> This is rather dependent on the microarchitecture. Do we have any
> numbers as to the variability?
>
Other than my strictly anecdotal evidence on my test system above, I do
not. But keeping those manipulations on the "mrs" side of the ISB seems
like a good idea because, depending on the microarchitecture, that mrs
can take a fairly long time. (I've heard of values as high as 90 nsec).
>> >* Is this an optimisation? If so, how much of an improvement can be
>> > seen?
>>
>> Optimization was not the main goal of this patch, yet performance did
>> improve on my target, with the average time improving marginally
>> (350 nsec after vs 360 nsec before the change), compared to the
>> refactored code. In fact, performance is now slightly better (around
>> a miniscule 2 nsec) than the original code, before the refactor, which
>> hurt performance.
>
> Ok.
>
>> >>+ .macro seqdata_acquire fallback, tzonly=NO_TZ, skipvcnt=0, getdata
>> >>+9999: ldar seqcnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
>> >>+8888: tbnz seqcnt, #0, 9999b
>> >> ldr w_tmp, [vdso_data, #VDSO_USE_SYSCALL]
>> >>- cbnz w_tmp, \fail
>> >>+ cbnz w_tmp, \fallback
>> >>+ \getdata
>> >>+ dmb ishld /* No loads from vdso_data after this point */
>> >
>> >What ordering guarantee is the DMB attempting to provide? Given we have
>> >the acquire, I assume some prior load, but I couldn't figure out what
>> >specifically.
>>
>> That barrier specifically ensures that loads performed by the
>> "getdata" sequence do not get accessed after the subsequent ldar check
>> of the sequence counter, since, as you know, ldar may allow loads that
>> come before it in program order to be accessed after it in much the
>> same way as stlr may allow stores that come after it to accessed
>> before it.
>
> Ok. I wonder what the relative performance of a DMB ISHLD; LDAR is
> relative to a DMB ISH LD, and whether that's actually a win across
> microarchitectures.
>
Keep in mind that on the "spin" case (at label 9999), it spins on a
single ldar, which due to it's one-way nature, is bound to be lighter
weight. In my experience, however, trying to nail down average timing
for a single barrier is difficult.
>> >>+ mov w9, seqcnt
>> >>+ ldar seqcnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
>> >
>> >Usually, acquire operations pair with a release operation elsewhere.
>> >What does this pair with?
>>
>> It was for that reason that I introduced stlr's into the writer code,
>> but the barrier provided by stlr was insufficient for my purposes, as
>> Will pointed out. There is no requirement or even suggestion in the
>> ARM that every use of ldar needs to be paired with stlr's.
>
> Sure. I guess I was asking which updater does this pair with, and I
> having dug, I see it's just update_vsyscall().
>
>> >>+ cmp w9, seqcnt
>> >>+ bne 8888b /* Do not needlessly repeat ldar and its implicit
>> >>barrier */
>> >>+ .if (\tzonly) != NO_TZ
>> >>+ cbz x0, \tzonly
>> >>+ .endif
>> >>+ .if (\skipvcnt) == 0
>> >>+ isb
>> >>+ mrs x_tmp, cntvct_el0
>> >>+ .endif
>> >> .endm
>> >
>> >All this conitional code makes the callers somehwat painful to read.
>> >
>> >It might be nicer to have this explicit in the calelrs that require it
>> >rather than conditional in the macro.
>>
>> The general use-case of the acquire sequence made this the cleanest
>> safe
>> implementation I could come up. If this isb/mrs sequence is split out
>> into each clock handler, it would serve to obscure the relationship
>> between the control-flow dependency (in this case, the "bne 8888b")
>> and
>> the isb. Keeping this acquire sequence intact helps to ensure that
>> future modifications adhere to the correct sequence. Note that if the
>> caller specifies neither option, the default is to leave these items
>> in place.
>
> As above, I'm not sure that the control dependency is key. Even if so,
> the logical sequence is:
>
> seqdata_acquire
> isb
> mrs
>
> I can't fathom why someone would move the ISB (and/or MRS) before the
> seqcnt_acquire.
>
This can be separated out, but it'll be repeated in a few places.
The only place the tz code is used is the gettimeofday logic itself.
>> >> .macro get_nsec_per_sec res
>> >>@@ -64,9 +70,6 @@ x_tmp .req x8
>> >> * shift.
>> >> */
>> >> .macro get_clock_shifted_nsec res, cycle_last, mult
>> >>- /* Read the virtual counter. */
>> >>- isb
>> >>- mrs x_tmp, cntvct_el0
>> >> /* Calculate cycle delta and convert to ns. */
>> >> sub \res, x_tmp, \cycle_last
>> >> /* We can only guarantee 56 bits of precision. */
>> >>@@ -137,17 +140,12 @@ x_tmp .req x8
>> >> ENTRY(__kernel_gettimeofday)
>> >> .cfi_startproc
>> >> adr vdso_data, _vdso_data
>> >>- /* If tv is NULL, skip to the timezone code. */
>> >>- cbz x0, 2f
>> >>-
>> >>- /* Compute the time of day. */
>> >>-1: seqcnt_acquire
>> >>- syscall_check fail=4f
>> >>- ldr x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
>> >>- /* w11 = cs_mono_mult, w12 = cs_shift */
>> >>- ldp w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
>> >>- ldp x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
>> >>- seqcnt_check fail=1b
>> >>+ seqdata_acquire fallback=4f tzonly=2f getdata=__stringify(\
>> >>+ ldr x10, [vdso_data, #VDSO_CS_CYCLE_LAST];\
>> >>+ /* w11 = cs_mono_mult, w12 = cs_shift */;\
>> >>+ ldp w11, w12, [vdso_data, #VDSO_CS_MONO_MULT];\
>> >>+ ldp x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC];\
>> >>+ ldp w4, w5, [vdso_data, #VDSO_TZ_MINWEST])
>> >
>> >Why do we need the stringify? Is that just so we can pass the code as a
>> >macro parameter? If so, it really doesn't look like the way to go...
>> >
>> >This is unfortunately painful to read.
>> >
>>
>> I implemented it this way to remain as similar as possible with the
>> refactored code that was recently merged, while at the same time
>> ensuring that, as I explained above, the reads of the vdso_data
>> performed by each clock type are completely contained within a set of
>> proper sequence count checks. That they were not contained led to
>> problems such as the improper handling of the timezone data before,
>> and it ensures that the isb follows the sequence check closely. This
>> use is not entirely dissimilar to other code which uses stringify
>> currently present in the arm64 kernel code which passes code as a
>> parameter. See, for example, arch/arm64/lib/copy_*_user.S.
>> All this said, however, I was never thrilled about going the stringify
>> route, but it was the most readable of any other variants I could
>> come up with (and far better than adding the extra ".if's" in the
>> macro).
>> Do you happen to have a better suggestion?
>
> I think that:
>
> ACQUIRE, blah, blah, blah
> < long >
> < code >
> < sequence >
> CONDITION_FAIL, blah, blah, blah
>
> Is clearer than dropping the code sequence into a macro parameter, even
> if there's come implicit dependency between the ACQUIRE and
> CONDITION_FAIL macro.
>
> Thanks,
> Mark.
I'll see what I can do to split this without spinning on the barrier.
Yes, it's only one spin, but if there's any clean way I can do that
I will. If the ldar result's bit 0 is set, it needs to reload anyway.
More information about the linux-arm-kernel
mailing list