[PATCH v2 1/9] driver core: Enable suppliers to implement fine grained sync_state support
Danilo Krummrich
dakr at kernel.org
Tue May 5 05:46:49 PDT 2026
On Tue May 5, 2026 at 1:12 PM CEST, Ulf Hansson wrote:
> On Wed, 22 Apr 2026 at 12:59, Danilo Krummrich <dakr at kernel.org> wrote:
>>
>> On Wed Apr 22, 2026 at 12:07 PM CEST, Ulf Hansson wrote:
>> > On Sat, 18 Apr 2026 at 13:23, Danilo Krummrich <dakr at kernel.org> wrote:
>> >> On Fri Apr 10, 2026 at 12:40 PM CEST, Ulf Hansson wrote:
>> >> > @@ -1126,6 +1128,9 @@ static void __device_links_queue_sync_state(struct device *dev,
>> >> > if (dev->state_synced)
>> >> > return;
>> >> >
>> >> > + if (dev->driver && dev->driver->queue_sync_state)
>> >> > + dev->driver->queue_sync_state(dev);
>> >>
>> >> This seems to be called without the device lock being held, which seems to allow
>> >> the queue_sync_state() callback to execute concurrently with remove(). This
>> >> opens the door for all kinds of UAF conditions in drivers.
>> >
>> > If that were the case, this whole function would be unsafe even before
>> > this change. I assume this isn't because of how the function is being
>> > called, but I may be wrong.
>>
>> This function does not issue any driver callbacks intentionally; the existing
>> sync_state() callback is deferred to device_links_flush_sync_list(), which is
>> called without the device_links_write_lock() held, but takes the device_lock()
>> to protect against other concurrent driver callbacks, such as remove().
>>
>> I.e. we can't take the device_lock() when the device_links_write_lock() is held,
>> as it would be prone to lock inversion.
>
> You are right, the device_lock() is needed in
> device_links_flush_sync_list() to protect against concurrent driver
> callbacks from being invoked, such as the ->remove().
>
> However, as part of the removal procedure in
> __device_release_driver(), we also need to account for device links.
> Hence we call device_links_busy() which takes the
> device_links_write_lock().
I don't think this is sufficient.
I wrapped my head around this about three weeks ago and I think there were
multiple cases where this falls apart. I think one case was:
1. Consumer binds, supplier is added to deferred_sync. Consumer unbinds,
link goes to AVAILABLE.
2. device_links_busy(supplier) takes and releases device_links_write_lock(),
finds no ACTIVE consumers, returns false -- drv->remove() starts without
device_links_write_lock() held.
3. device_links_supplier_sync_state_resume() takes device_links_write_lock(),
finds supplier still on deferred_sync -- device_links_driver_cleanup()
hasn't run yet, blocked on the lock. queue_sync_state() is called
concurrently with drv->remove().
But again, I think that's what I concluded three weeks ago and this state
machine is rather tricky.
That said, *even if* we could prove that this works in all cases, it would be
very subtle, pretty fragile and not by design.
The whole state machine around sync state is already rather complicated. So, if
we really want to make it an invariant that it is valid to call bus device
callbacks without holding the device lock from
__device_links_queue_sync_state(), I think this has to be carefully reflected by
the overall design making it *very explicit* how this invariant holds up in all
cases.
Otherwise, I'd be rather concerned that this becomes a source of subtle bugs.
More information about the linux-arm-kernel
mailing list