[PATCH] workqueue: Fix memory ordering race in queue_work*()

Tue Aug 16 09:41:52 PDT 2022

On Tue, Aug 16, 2022 at 6:42 AM Will Deacon <will at kernel.org> wrote:
>
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.

It's interesting how this bug has existed for basically four years.

I suspect that the thing is fairly hard to hit in practice,
particularly on the kinds of devices most common (ie limited OoO
windows on most phone SoCs).

Together with phone loads probably not generally being all that
exciting from a kernel standpoint (a lot of driver work, probably not
a lot of workqueue stuff).

>   1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
>      regardless of the value which is read from memory. The lock/unlock
>      flavours can remain as-is.

I've applied Hector's "locking/atomic: Make test_and_*_bit() ordered
on failure".

>   2. Fix the documentation

That patch had a bit of a fix, but it is probably worth looking at more.

>   3. Figure out what to do about architectures building atomics out of
>      spinlocks (probably ok as lock+unlock == full barrier there?)

Yeah, I wonder how we should describe the memory ordering here. We've
always punted on it, saying it's a "memory barrier", but that has been
partly a "look, that's the traditional x86 model".

And the traditional x86 model really sees the locked RMW operations as
being one single operation - in ways that most other architectures
don't.

So on x86, it's more than "this implies a memory barrier" - it's also
that there is no visible load-modify-store sequence where you can
start asking "what about the ordering of the _load_ wrt the preceding
memory operations".

That makes the x86 behavior very easy to think about, but means that
when you have bitops implemented other ways, you have questions that
are much harder to answer.

So in a Lock+read+op+write+unlock sequence, you can have preceding
operations move into the locked region, and mix with the read+op+write
side.

>   4. Accept my sincerest apologies for the mess!

I don't think you were the source of the mess. The *source* of the
mess is that we've always had very messy rules about the bitops in
particular.

I think the *code* is fixed (at least wrt the generic implementation,
I think the other models are up for discussion), but I think the real
issue is how we should describe the requirements.

So I think we have at least three cases we need to deal with:

 (a) the people who just want atomics, and don't care about any
ordering. They are bound to exist.

 (b) the people who want "acquire"-like semantics. I think the
existing test_and_set_bit_lock() is fine

 (c) the people who want "release"-like semantics, where the
"test-and-set-bit" is for announcing "I'm done, was I the first one?".

 (d) the full barrier case

I think we currently actively have (b) and (d), but it's possible that
the workqueue case really is only (c).

And I think the spinlock implementation really most naturally has that
(c) semantics - you don't necessarily get some theoretical full memory
barrier, but you *do* get those "handover" semantics.

Maybe we never really want or need (d) at all?

So I get the feeling that we really shouldn't specify
"test_and_set_bit()" in terms of memory barriers at all. I *think* it
would be more natural to describe them in terms of "handover" (ie
acquire/release), and then the spinlock semantics are obviously fine.

So I htink the code problem is easy, I think the real problem here has
always been bad documentation, and it would be really good to clarify
that.

Comments?

              Linus